NiH data (health research)

Batchables for the collection and processing of NiH data. As documented under packages and routines, the pipeline is executed in the following order (documentation for the files is given below, which isn’t super-informative. You’re better off looking under packages and routines).

The data is collected from official data dumps, parsed into MySQL (tier 0) and then piped into Elasticsearch (tier 1), post-processing. (nih_collect_data)

Collect NiH table from the official data dump, based on the name of the table. The data is piped into the MySQL database.

run()[source] (nih_process_data)

Geocode NiH data (from MySQL) and pipe into Elasticsearch.

run()[source] (nih_abstract_mesh_data)

Retrieve NiH abstracts from MySQL, assign pre-calculated MeSH terms for each abstract, and pipe data into Elasticsearch. Exact abstract duplicates are removed at this stage.


Removes multiple spaces, tabs and newlines.

Parameters:abstract (str) – text to be cleaned
(str): cleaned text
run()[source] (nih_dedupe)

Deduplicate NiH articles based on similarity scores using Elasticsearch’s document similarity API. Similarity is calculated based on the description of the project, the project abstract and the title of the project. Funding information is aggregated (summed) across all deduplicated articles, for the total and annuals funds.

get_value(obj, key)[source]

Retrieve a value by key if exists, else return None.


Extract yearly funds