Natural Language Processing¶

Batchable utilities for NLP. Note that modules prefixed with [AUTOML] are designed to be launched by AutoMLTask, and those with the addition * (i.e. [AUTOML*]) are the designed to be the final task in an AutoMLTask chain (i.e. they provide a ‘loss’).

[AutoML*] run.py (corex_topic_model)¶

Generate topics based on the CorEx algorithm. Loss is calculated from the total correlation explained.

run()[source]¶

[AutoML] run.py (ngrammer)¶

Find and replace ngrams in a body of text, based on Wiktionary N-Grams. Whilst at it, the ngrammer also tokenizes and removes stop words (unless they occur within an n-gram)

run()[source]¶

[AutoML] run.py (tfidf)¶

Applies TFIDF cuts to a dataset via environmental variables lower_tfidf_percentile and upper_tfidf_percentile.

chunker(_transformed, n_chunks)[source]¶

Yield chunks from a numpy array.

Parameters:	_transformed (np.array) – Array to split into chunks. n_chunks (int) – Number of chunks to split the array into.
Yields:	chunk (np.array)

run()[source]¶

[AutoML] vectorizer (run.py)¶

Vectorizes (counts or binary) text data, and applies basic filtering of extreme term/document frequencies.

term_counts(dct, row, binary=False)[source]¶

Convert a single single document to term counts via a gensim dictionary.

Parameters:	dct (Dictionary) – Gensim dictionary. row (str) – A document. binary (bool) – Binary rather than total count?
Returns:	dict of term id (from the Dictionary) to term count.

optional(name, default)[source]¶: Defines optional env fields with default values

merge_lists(list_of_lists)[source]¶: Join a lists of lists into a single list. Returns an empty list if the input is not a list, which is expected to happen (from the ngrammer) if no long text was found

run()[source]¶