Health data

Initially for our project with the Robert Woods Johnson Foundation (RWJF), these procedures outline the data collection of health-specific data.

Collect NIH

Extract all of the NIH World RePORTER data via their static data dump. N_TABS outputs are produced in CSV format (concatenated across all years), where N_TABS correspondes to the number of tabs in the main table found at:

The data is transferred to the Nesta intermediate data bucket.


Get all CSV URLs from the tab_index`th tab of the main table found at :code:`TOP_URL.

Parameters:tab_index (int) – Tab number (0-indexed) of table to extract CSV URLs from.
Returns:Title of the tab in the table. hrefs (list): List of URLs pointing to data CSVs.
Return type:title (str)

Standardise inconsistently formatted field names, by replacing non-alphanums with single underscores and lowercasing.


Yield rows from the CSV (found at URL url) as JSON (well, dict objects).

Parameters:url (str) – The URL at which a zipped-up CSV is found.
Yields:dict object, representing one row of the CSV.


Data cleaning / wrangling before ingestion of raw data, specifically:

  • Systematically removing generic prefixes using very hard-coded logic.
  • Inferring how to correctly deal with mystery question marks, using very hard-coded logic.
  • Splitting of strings into arrays as indicated by JSON the ORM,
  • CAPS to Camel Case for any string field which isn’t VARCHAR(n) < 10
  • Dealing consistently with null values
  • explicit conversion to datetime of relevant fields

Return the column names in the ORM which are of JSON type


Return the column names in the ORM which are a text type, (i.e. TEXT or VARCHAR) and if a max length is specified, with max length > 10. The length requirement is because we don’t want to preprocess ID-like or code fields (e.g. ISO codes).


Return the column names in the ORM which are of JSON type

is_nih_null(value, nulls=('', []{}, 'N/A', 'Not Required', 'None'))[source]

Returns True if the value is listed in the nulls argument, or the value is NaN, null or None.


Expand GENERIC_PREFIXES to include integers, and then a large numbers of permutations of additional characters, upper case and title case. From tests, this covers 19 out of 20 generic prefixes from either abstract text or the “PHR” field.


Iteratively remove any of the generic terms in expand_prefix_list from the front of the text, until none remain.


Iteratively replace any large spaces or tabs with a single space, until none remain.


Somewhere in NiH’s backend, they have a unicode processing problem. From inspection, most of the ‘?’ symbols have quite an intuitive origin, and so this function contains the hard-coded logic for inferring what symbol used to be in the place of each ‘?’.


A lot of abstracts end with ‘!’ and then a bunch of spaces.

upper_to_title(text, force_title=False)[source]

Inconsistently, NiH has fields as all upper case. Convert to titlecase

clean_text(text, suffix_removal_length=100)[source]

Apply the full text-cleaning procedure.


Split values either by colons or commas. If there are more colons than commas (+1), then colons are used for splitting (this takes into account that NiH name fields are written as ‘last_name, first_name; next_last_name, next_first_name’). Otherwise NiH list fields are delimeted by commas.


Apply detect_and_split and then apply some general cleaning.


Convert to date, unless the value null, or poorly formatted.


In a very small number of cases, NiH has some pre-processed badly formatted unspecifed unicode characters. The following recipe seems to clean up all of the discovered cases.

preprocess_row(row, orm)[source]

Clean text, split values and standardise nulls, as required.

  • row (dict) – Row of data to clean, that should match the provided ORM.
  • orm (SqlAlchemy selectable) – ORM from which to infer JSON and text fields.

Process NIH

Data cleaning and processing procedures for the NIH World Reporter data. Specifically, a lat/lon is generated for each city/country; and the formatting of date fields is unified.