Natural Language Processing¶
Batchable utilities for NLP. Note that modules prefixed with
designed to be launched by
AutoMLTask, and those with the addition * (i.e.
[AUTOML*]) are the designed to be the final task in an
(i.e. they provide a ‘loss’).
[AutoML*] run.py (corex_topic_model)¶
Generate topics based on the CorEx algorithm. Loss is calculated from the total correlation explained.
[AutoML] run.py (ngrammer)¶
Find and replace ngrams in a body of text, based on Wiktionary N-Grams. Whilst at it, the ngrammer also tokenizes and removes stop words (unless they occur within an n-gram)
[AutoML] run.py (tfidf)¶
Applies TFIDF cuts to a dataset via environmental variables lower_tfidf_percentile and upper_tfidf_percentile.
Yield chunks from a numpy array.
- _transformed (np.array) – Array to split into chunks.
- n_chunks (int) – Number of chunks to split the array into.
[AutoML] vectorizer (run.py)¶
Vectorizes (counts or binary) text data, and applies basic filtering of extreme term/document frequencies.
term_counts(dct, row, binary=False)¶
Convert a single single document to term counts via a gensim dictionary.
- dct (Dictionary) – Gensim dictionary.
- row (str) – A document.
- binary (bool) – Binary rather than total count?
dict of term id (from the Dictionary) to term count.
Defines optional env fields with default values
Join a lists of lists into a single list. Returns an empty list if the input is not a list, which is expected to happen (from the ngrammer) if no long text was found