arXiv data (technical research)¶
Data collection and processing pipeline for arXiv data, principally for the arXlive platform. This pipeline orchestrates the collection of arXiv data, enrichment (via MAG and GRID), topic modelling, and novelty (lolvelty) measurement.
Collection task¶
Luigi routine to collect new data from the arXiv api and load it to MySQL.
-
class
CollectNewTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.Task
Collect new data from the arXiv api and dump the data in the MySQL server.
Parameters: - date (datetime) – Datetime used to label the outputs
- _routine_id (str) – String used to label the AWS task
- db_config_env (str) – environmental variable pointing to the db config file
- db_config_path (str) – The output database configuration
- insert_batch_size (int) – number of records to insert into the database at once
- articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format
-
date
= <luigi.parameter.DateParameter object>¶
-
test
= <luigi.parameter.BoolParameter object>¶
-
db_config_env
= <luigi.parameter.Parameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
Date task¶
Luigi wrapper to identify the date since the last iterative data collection
-
class
DateTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.WrapperTask
Collect new data from the arXiv api and dump the data in the MySQL server.
Parameters: - date (datetime) – Datetime used to label the outputs
- _routine_id (str) – String used to label the AWS task
- db_config_env (str) – environmental variable pointing to the db config file
- db_config_path (str) – The output database configuration
- insert_batch_size (int) – number of records to insert into the database at once
- articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format
-
date
= <luigi.parameter.DateParameter object>¶
-
test
= <luigi.parameter.BoolParameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
db_config_env
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
arXiv enriched with MAG (API)¶
Luigi routine to query the Microsoft Academic Graph for additional data and append it to the exiting data in the database.
-
class
QueryMagTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.Task
- Query the MAG for additional data to append to the arxiv articles,
- primarily the fields of study.
Parameters: - date (datetime) – Datetime used to label the outputs
- _routine_id (str) – String used to label the AWS task
- db_config_env (str) – environmental variable pointing to the db config file
- db_config_path (str) – The output database configuration
- mag_config_path (str) – Microsoft Academic Graph Api key configuration path
- insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
- articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
-
date
= <luigi.parameter.DateParameter object>¶
-
test
= <luigi.parameter.BoolParameter object>¶
-
db_config_env
= <luigi.parameter.Parameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
mag_config_path
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
-
requires
()[source]¶ The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
arXiv enriched with MAG (SPARQL)¶
Luigi routine to query the Microsoft Academic Graph for additional data and append it to the exiting data in the database. This is to collect information which is difficult to retrieve via the MAG API.
-
class
MagSparqlTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.Task
- Query the MAG for additional data to append to the arxiv articles,
- primarily the fields of study.
Parameters: - date (datetime) – Datetime used to label the outputs
- _routine_id (str) – String used to label the AWS task
- db_config_env (str) – environmental variable pointing to the db config file
- db_config_path (str) – The output database configuration
- mag_config_path (str) – Microsoft Academic Graph Api key configuration path
- insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
- articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
-
date
= <luigi.parameter.DateParameter object>¶
-
test
= <luigi.parameter.BoolParameter object>¶
-
db_config_env
= <luigi.parameter.Parameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
mag_config_path
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
-
requires
()[source]¶ The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
arXiv enriched with GRID¶
Luigi routine to lookup arXiv author’s institutes via the GRID data, in order to “geocode” arXiv articles. The matching of institute name to GRID data is done via smart(ish) fuzzy matching, which then gives a confidence score per match.
-
class
GridTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.Task
Join arxiv articles with GRID data for institute addresses and geocoding.
Parameters: - date (datetime) – Datetime used to label the outputs
- _routine_id (str) – String used to label the AWS task
- db_config_env (str) – environmental variable pointing to the db config file
- db_config_path (str) – The output database configuration
- mag_config_path (str) – Microsoft Academic Graph Api key configuration path
- insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
- articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
-
date
= <luigi.parameter.DateParameter object>¶
-
test
= <luigi.parameter.BoolParameter object>¶
-
db_config_env
= <luigi.parameter.Parameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
mag_config_path
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
-
requires
()[source]¶ The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
-
class
GridRootTask
(*args, **kwargs)[source]¶ Bases:
luigi.task.WrapperTask
-
date
= <luigi.parameter.DateParameter object>¶
-
db_config_path
= <luigi.parameter.Parameter object>¶
-
production
= <luigi.parameter.BoolParameter object>¶
-
drop_and_recreate
= <luigi.parameter.BoolParameter object>¶
-
articles_from_date
= <luigi.parameter.Parameter object>¶
-
insert_batch_size
= <luigi.parameter.IntParameter object>¶
-
debug
= <luigi.parameter.BoolParameter object>¶
-