Ontologies and schemas

Tier 0

Raw data collections (“tier 0”) in the production system do not adhere to a fixed schema or ontology, but instead have a schema which is very close to the raw data. Modifications to field names tend to be quite basic, such as lowercase and removal of whitespace in favour of a single underscore.

Tier 1

Processed data (“tier 1”) is intended for public consumption, using a common ontology. The convention we use is as follows:

  • Field names are composed of up to three terms: a firstName, middleName and lastName
  • Each term (e.g. firstName) is written in lowerCamelCase.
  • firstName terms correspond to a restricted set of basic quantities.
  • middleName terms correspond to a restricted set of modifiers (e.g. adjectives) which add nuance to the firstName term. Note, the special middleName term of is reserved as the default value in case no middleName is specified.
  • lastName terms correspond to a restricted set of entity types.

Valid examples are date_start_project and title_of_project.

Tier 0 fields are implictly excluded from tier 1 if they are missing from the schema_transformation file. Tier 1 schema field names are applied via nesta.packages.decorator.schema_transform

Tier 2

Although not-yet-implemented, the tier 2 schema is reserved for future graph ontologies. Don’t expect any changes any time soon!

Elasticsearch mappings

Our methodology for constructing Elasticsearch mappings is described here. It is intended to minimise duplication of efforts and enforce standardisation when referring to a common dataset whilst being flexible to individual project needs. It is implied in our framework that a single dataset can be used across many projects, and each project is mapped to a single endpoint. It is useful to start by looking at the structure of the nesta/core/schemas/tier_1/mappings/ directory:

├── datasets
│   ├── arxiv_mapping.json
│   ├── companies_mapping.json
│   ├── cordis_mapping.json
│   ├── gtr_mapping.json
│   ├── meetup_mapping.json
│   ├── nih_mapping.json
│   └── patstat_mapping.json
├── defaults
│   └── defaults.json
└── endpoints
    ├── arxlive
    │   └── arxiv_mapping.json
    ├── eurito
    │   ├── arxiv_mapping.json
    │   ├── companies_mapping.json
    │   └── patstat_mapping.json
    └── health-scanner
        ├── aliases.json
        ├── config.yaml
        ├── nih_mapping.json
        └── nulls.json

Firstly we consider defaults/defaults.json which should contain all default fields for all mappings - for example standard analyzers and dynamic strictness. We might also consider putting global fields there.

Next consider the datasets subdirectory. Each mapping file in here should contain the complete mappings field for the respective dataset. The naming convention <dataset>_mapping.json is a hard requirement, as <dataset> will map to the index for this dataset at any given endpoint.

Finally consider the endpoints subdirectory. Each sub-subdirectory here should map to any endpoint which requires changes beyond the defaults and datasets mappings. Each mapping file within each endpoint sub-subdirectory (e.g. arxlive or health-scanner) should satisfy the same naming convention (<dataset>_mapping.json). All conventions here are also consistent with the elasticsearch.yaml configuration file (to see this configuration, you will need to clone the repo and follow these steps to unencrypt the config), which looks a little like this:

## The following assumes the AWS host endpoing naming convention:
## {scheme}://search-{endpoint}-{id}.{region}.es.amazonaws.com
  scheme: https
  port: 443
  region: eu-west-2
  type: _doc
  # -------------------------------
  # <AWS endpoint domain name>:
  #   id: <AWS endpoint UUID>
  #   <default override key>: <default override value>  ## e.g.: scheme, port, region, _type
  #   indexes:
  #     <index name>: <incremental version number>  ## Note: defaults to <index name>_dev in testing mode
  # -------------------------------
    id: <this is a secret>
      arxiv: 4
  # -------------------------------
    id: <this is a secret>
      nih: 6
      companies: 5
      meetup: 4
... etc ...

Note that for the health-scanner endpoint, companies and meetup will be generated from the datasets mappings, as they are not specified under the endpoints/health-scanner subdirectory. Also note that endpoints sub-directories do not need to exist for each endpoint to be generated: the mappings will simply be generated from the dataset defaults. For example, a new endpoint general can be generated from the DAPS codebase using the above, even though there is no endpoints/general sub-subdirectory.

Individual endpoints can also specify aliases.json which harmonises field names across datasets for specific endpoints. This uses a convention as follows:

    #...the convention is...
    "<new field name>": {
        "<dataset 1>": "<old field name 1>",
        "<dataset 2>": "<old field name 2>",
        "<dataset 3>": "<old field name 3>"
    #...an example is...
    "city": {
        "companies": "placeName_city_organisation",
        "meetup": "placeName_city_group",
        "nih": "placeName_city_organisation"

By default, this applies (what Joel calls) a “soft” alias, which is an Elasticsearch alias, however by specifying hard-alias=true in config.yaml (see health-scanner above), the alias is instead applied directly (i.e. field names are physically replaced, not aliased).

You will also notice the nulls.json file in the health-scanner endpoint. This is a relatively experimental feature for automatically nullifying values on ingestion through ElasticsearchPlus, in lieu of proper exploratory data analysis. The logic and format for this is documented here.

Mapping construction hierarchy

Each mapping is constructed by overriding nested fields using the defaults datasets and endpoints, in that order (i.e. endpoints override nested fields in datasets, and datasets override those in defaults). If you would like to “switch off” a field from the defaults or datasets mappings, you should set the value of the nested field to null. For example:

    "mappings": {
        "_doc": {
            "dynamic": "strict",
            "properties": {
                "placeName_zipcode_organisation": null

will simply “switch off” the field placeName_zipcode_organisation, which was specified in datasets.

The logic for the mapping construction hierarchy is demonstrated in the respective orms.orm_utils.get_es_mapping function:

def get_es_mapping(dataset, endpoint):
    '''Load the ES mapping for this dataset and endpoint,
    including aliases.

        dataset (str): Name of the dataset for the ES mapping.
        endpoint (str): Name of the AWS ES endpoint.
    mapping = _get_es_mapping(dataset, endpoint)
    _apply_alias(mapping, dataset, endpoint)
    _prune_nested(mapping)  # prunes any nested keys with null values
    return mapping

Integrated tests

The following pytest tests are made (and triggered on PR via travis):

  • aliases.json files are checked for consistency with available datasets.
  • All mappings for each in datasets and endpoints are fully generated, and tested for compatibility with the schema transformations (which are, in turn, checked against the valid ontology in ontology.json).

Features in DAPS2

  • The index version (e.g. 'arxiv': 4 in elasticsearch.yaml) will be automatically generated from semantic versioning and the git hash in DAPS2, therefore the indexes field will consolidate to an itemised list of indexes.
  • The mappings under datasets will be automatically generated from the open ontology which will be baked into the tier-0 schemas. This will render schema_transformations redundant.
  • Elasticsearch components will be factored out of orm_utils.