Natural Language Processing

Batchable utilities for NLP. Note that modules prefixed with [AUTOML] are designed to be launched by AutoMLTask, and those with the addition * (i.e. [AUTOML*]) are the designed to be the final task in an AutoMLTask chain (i.e. they provide a ‘loss’).

[AutoML*] (corex_topic_model)

Generate topics based on the CorEx algorithm. Loss is calculated from the total correlation explained.


[AutoML] (ngrammer)

Find and replace ngrams in a body of text, based on Wiktionary N-Grams. Whilst at it, the ngrammer also tokenizes and removes stop words (unless they occur within an n-gram)


[AutoML] (tfidf)

Applies TFIDF cuts to a dataset via environmental variables lower_tfidf_percentile and upper_tfidf_percentile.

chunker(_transformed, n_chunks)[source]

Yield chunks from a numpy array.

  • _transformed (np.array) – Array to split into chunks.
  • n_chunks (int) – Number of chunks to split the array into.

chunk (np.array)


[AutoML] vectorizer (

Vectorizes (counts or binary) text data, and applies basic filtering of extreme term/document frequencies.

term_counts(dct, row, binary=False)[source]

Convert a single single document to term counts via a gensim dictionary.

  • dct (Dictionary) – Gensim dictionary.
  • row (str) – A document.
  • binary (bool) – Binary rather than total count?

dict of term id (from the Dictionary) to term count.

optional(name, default)[source]

Defines optional env fields with default values


Join a lists of lists into a single list. Returns an empty list if the input is not a list, which is expected to happen (from the ngrammer) if no long text was found