Natural Language Processing¶
Batchable utilities for NLP. Note that modules prefixed with [AUTOML]
are
designed to be launched by AutoMLTask
, and those with the addition * (i.e.
[AUTOML*]
) are the designed to be the final task in an AutoMLTask
chain
(i.e. they provide a ‘loss’).
[AutoML*] run.py (corex_topic_model)¶
Generate topics based on the CorEx algorithm. Loss is calculated from the total correlation explained.
[AutoML] run.py (ngrammer)¶
Find and replace ngrams in a body of text, based on Wiktionary N-Grams. Whilst at it, the ngrammer also tokenizes and removes stop words (unless they occur within an n-gram)
[AutoML] run.py (tfidf)¶
Applies TFIDF cuts to a dataset via environmental variables lower_tfidf_percentile and upper_tfidf_percentile.
[AutoML] vectorizer (run.py)¶
Vectorizes (counts or binary) text data, and applies basic filtering of extreme term/document frequencies.
-
term_counts
(dct, row, binary=False)[source]¶ Convert a single single document to term counts via a gensim dictionary.
Parameters: - dct (Dictionary) – Gensim dictionary.
- row (str) – A document.
- binary (bool) – Binary rather than total count?
Returns: dict of term id (from the Dictionary) to term count.