Ontologies and schemas¶
Tier 0¶
Raw data collections (“tier 0”) in the production system do not adhere to a fixed schema or ontology, but instead have a schema which is very close to the raw data. Modifications to field names tend to be quite basic, such as lowercase and removal of whitespace in favour of a single underscore.
Tier 1¶
Processed data (“tier 1”) is intended for public consumption, using a common ontology. The convention we use is as follows:
- Field names are composed of up to three terms: a
firstName
,middleName
andlastName
- Each term (e.g.
firstName
) is written in lowerCamelCase. firstName
terms correspond to a restricted set of basic quantities.middleName
terms correspond to a restricted set of modifiers (e.g. adjectives) which add nuance to thefirstName
term. Note, the specialmiddleName
termof
is reserved as the default value in case nomiddleName
is specified.lastName
terms correspond to a restricted set of entity types.
Valid examples are date_start_project
and title_of_project
.
Tier 0 fields are implictly excluded from tier 1 if they are missing from the schema_transformation
file. Tier 1 schema field names are applied via nesta.packages.decorator.schema_transform
Tier 2¶
Although not-yet-implemented, the tier 2 schema is reserved for future graph ontologies. Don’t expect any changes any time soon!
Elasticsearch mappings¶
Our methodology for constructing Elasticsearch mappings is described here. It is intended to minimise duplication of efforts and enforce standardisation when referring to a common dataset whilst being flexible to individual project needs. It is implied in our framework that a single dataset
can be used across many projects, and each project is mapped to a single endpoint
. It is useful to start by looking at the structure of the nesta/core/schemas/tier_1/mappings/
directory:
.
├── datasets
│ ├── arxiv_mapping.json
│ ├── companies_mapping.json
│ ├── cordis_mapping.json
│ ├── gtr_mapping.json
│ ├── meetup_mapping.json
│ ├── nih_mapping.json
│ └── patstat_mapping.json
├── defaults
│ └── defaults.json
└── endpoints
├── arxlive
│ └── arxiv_mapping.json
├── eurito
│ ├── arxiv_mapping.json
│ ├── companies_mapping.json
│ └── patstat_mapping.json
└── health-scanner
├── aliases.json
├── config.yaml
├── nih_mapping.json
└── nulls.json
Firstly we consider defaults/defaults.json
which should contain all default fields for all mappings - for example standard analyzers and dynamic strictness. We might also consider putting global fields there.
Next consider the datasets
subdirectory. Each mapping file in here should contain the complete mappings
field for the respective dataset. The naming convention <dataset>_mapping.json
is a hard requirement, as <dataset>
will map to the index for this dataset
at any given endpoint
.
Finally consider the endpoints
subdirectory. Each sub-subdirectory here should map to any endpoint
which requires changes beyond the defaults
and datasets
mappings. Each mapping file within each endpoint
sub-subdirectory (e.g. arxlive
or health-scanner
) should satisfy the same naming convention (<dataset>_mapping.json
). All conventions here are also consistent with the elasticsearch.yaml
configuration file (to see this configuration, you will need to clone the repo and follow these steps to unencrypt the config), which looks a little like this:
## The following assumes the AWS host endpoing naming convention:
## {scheme}://search-{endpoint}-{id}.{region}.es.amazonaws.com
defaults:
scheme: https
port: 443
region: eu-west-2
type: _doc
endpoints:
# -------------------------------
# <AWS endpoint domain name>:
# id: <AWS endpoint UUID>
# <default override key>: <default override value> ## e.g.: scheme, port, region, _type
# indexes:
# <index name>: <incremental version number> ## Note: defaults to <index name>_dev in testing mode
# -------------------------------
arxlive:
id: <this is a secret>
indexes:
arxiv: 4
# -------------------------------
health-scanner:
id: <this is a secret>
indexes:
nih: 6
companies: 5
meetup: 4
... etc ...
Note that for the health-scanner
endpoint, companies
and meetup
will be generated from the datasets
mappings, as they are not specified under the endpoints/health-scanner
subdirectory. Also note that endpoints
sub-directories do not need to exist for each endpoint
to be generated: the mappings will simply be generated from the dataset defaults. For example, a new endpoint general
can be generated from the DAPS codebase using the above, even though there is no endpoints/general
sub-subdirectory.
Individual endpoints
can also specify aliases.json
which harmonises field names across datasets for specific endpoints. This uses a convention as follows:
{
#...the convention is...
"<new field name>": {
"<dataset 1>": "<old field name 1>",
"<dataset 2>": "<old field name 2>",
"<dataset 3>": "<old field name 3>"
},
#...an example is...
"city": {
"companies": "placeName_city_organisation",
"meetup": "placeName_city_group",
"nih": "placeName_city_organisation"
},
#...etc...#
}
By default, this applies (what Joel calls) a “soft” alias, which is an Elasticsearch alias, however by specifying hard-alias=true
in config.yaml
(see health-scanner
above), the alias is instead applied directly (i.e. field names are physically replaced, not aliased).
You will also notice the nulls.json
file in the health-scanner
endpoint. This is a relatively experimental feature for automatically nullifying values on ingestion through ElasticsearchPlus, in lieu of proper exploratory data analysis. The logic and format for this is documented here.
Mapping construction hierarchy¶
Each mapping is constructed by overriding nested fields using the defaults
datasets
and endpoints
, in that order (i.e. endpoints
override nested fields in datasets
, and datasets
override those in defaults
). If you would like to “switch off” a field from the defaults
or datasets
mappings, you should set the value of the nested field to null
. For example:
{
"mappings": {
"_doc": {
"dynamic": "strict",
"properties": {
"placeName_zipcode_organisation": null
}
}
}
}
will simply “switch off” the field placeName_zipcode_organisation
, which was specified in datasets
.
The logic for the mapping construction hierarchy is demonstrated in the respective orms.orm_utils.get_es_mapping
function:
def get_es_mapping(dataset, endpoint):
'''Load the ES mapping for this dataset and endpoint,
including aliases.
Args:
dataset (str): Name of the dataset for the ES mapping.
endpoint (str): Name of the AWS ES endpoint.
Returns:
:obj:`dict`
'''
mapping = _get_es_mapping(dataset, endpoint)
_apply_alias(mapping, dataset, endpoint)
_prune_nested(mapping) # prunes any nested keys with null values
return mapping
Integrated tests¶
The following pytest
tests are made (and triggered on PR via travis):
aliases.json
files are checked for consistency with availabledatasets
.- All mappings for each in
datasets
andendpoints
are fully generated, and tested for compatibility with the schema transformations (which are, in turn, checked against the valid ontology inontology.json
).
Features in DAPS2¶
- The index version (e.g.
'arxiv': 4
inelasticsearch.yaml
) will be automatically generated from semantic versioning and the git hash in DAPS2, therefore theindexes
field will consolidate to an itemised list of indexes. - The mappings under
datasets
will be automatically generated from the open ontology which will be baked into the tier-0 schemas. This will renderschema_transformations
redundant. - Elasticsearch components will be factored out of
orm_utils
.