NLP Utils

Standard tools for aiding natural language processing.

Preprocess

Tools for preprocessing text.

tokenize_document(text, remove_stops=False)[source]

Preprocess a whole raw document. :param text: Raw string of text. :type text: str :param remove_stops: Flag to remove english stopwords :type remove_stops: bool

Returns:List of preprocessed and tokenized documents
clean_and_tokenize(text, remove_stops)[source]

Preprocess a raw string/sentence of text. :param text: Raw string of text. :type text: str :param remove_stops: Flag to remove english stopwords :type remove_stops: bool

Returns:Preprocessed tokens.
Return type:tokens (list, str)
filter_by_idf(documents, lower_idf_limit, upper_idf_limit)[source]

Remove (from documents) terms which are in a range of IDF values.

Parameters:
  • documents (list) – Either a list of str or a list of list of str to be filtered.
  • lower_idf_limit (float) – Lower percentile (between 0 and 100) on which to exclude terms by their IDF.
  • upper_idf_limit (float) – Upper percentile (between 0 and 100) on which to exclude terms by their IDF.
Returns:

Filtered documents