arXiv data (technical research)

Data collection and processing pipeline for arXiv data, principally for the arXlive platform. This pipeline orchestrates the collection of arXiv data, enrichment (via MAG and GRID), topic modelling, and novelty (lolvelty) measurement.

Collection task

Luigi routine to collect new data from the arXiv api and load it to MySQL.

class CollectNewTask(*args, **kwargs)[source]

Bases: luigi.task.Task

Collect new data from the arXiv api and dump the data in the MySQL server.

Parameters:
  • date (datetime) – Datetime used to label the outputs
  • _routine_id (str) – String used to label the AWS task
  • db_config_env (str) – environmental variable pointing to the db config file
  • db_config_path (str) – The output database configuration
  • insert_batch_size (int) – number of records to insert into the database at once
  • articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format
date = <luigi.parameter.DateParameter object>
test = <luigi.parameter.BoolParameter object>
db_config_env = <luigi.parameter.Parameter object>
db_config_path = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
articles_from_date = <luigi.parameter.Parameter object>
output()[source]

Points to the output database engine

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

Date task

Luigi wrapper to identify the date since the last iterative data collection

class DateTask(*args, **kwargs)[source]

Bases: luigi.task.WrapperTask

Collect new data from the arXiv api and dump the data in the MySQL server.

Parameters:
  • date (datetime) – Datetime used to label the outputs
  • _routine_id (str) – String used to label the AWS task
  • db_config_env (str) – environmental variable pointing to the db config file
  • db_config_path (str) – The output database configuration
  • insert_batch_size (int) – number of records to insert into the database at once
  • articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format
date = <luigi.parameter.DateParameter object>
test = <luigi.parameter.BoolParameter object>
db_config_path = <luigi.parameter.Parameter object>
db_config_env = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
articles_from_date = <luigi.parameter.Parameter object>
requires()[source]

Collects the last date of successful update from the database and launches the iterative data collection task.

arXiv enriched with MAG (API)

Luigi routine to query the Microsoft Academic Graph for additional data and append it to the exiting data in the database.

class QueryMagTask(*args, **kwargs)[source]

Bases: luigi.task.Task

Query the MAG for additional data to append to the arxiv articles,
primarily the fields of study.
Parameters:
  • date (datetime) – Datetime used to label the outputs
  • _routine_id (str) – String used to label the AWS task
  • db_config_env (str) – environmental variable pointing to the db config file
  • db_config_path (str) – The output database configuration
  • mag_config_path (str) – Microsoft Academic Graph Api key configuration path
  • insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
  • articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
date = <luigi.parameter.DateParameter object>
test = <luigi.parameter.BoolParameter object>
db_config_env = <luigi.parameter.Parameter object>
db_config_path = <luigi.parameter.Parameter object>
mag_config_path = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
articles_from_date = <luigi.parameter.Parameter object>
output()[source]

Points to the output database engine

requires()[source]

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

arXiv enriched with MAG (SPARQL)

Luigi routine to query the Microsoft Academic Graph for additional data and append it to the exiting data in the database. This is to collect information which is difficult to retrieve via the MAG API.

class MagSparqlTask(*args, **kwargs)[source]

Bases: luigi.task.Task

Query the MAG for additional data to append to the arxiv articles,
primarily the fields of study.
Parameters:
  • date (datetime) – Datetime used to label the outputs
  • _routine_id (str) – String used to label the AWS task
  • db_config_env (str) – environmental variable pointing to the db config file
  • db_config_path (str) – The output database configuration
  • mag_config_path (str) – Microsoft Academic Graph Api key configuration path
  • insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
  • articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
date = <luigi.parameter.DateParameter object>
test = <luigi.parameter.BoolParameter object>
db_config_env = <luigi.parameter.Parameter object>
db_config_path = <luigi.parameter.Parameter object>
mag_config_path = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
articles_from_date = <luigi.parameter.Parameter object>
output()[source]

Points to the output database engine

requires()[source]

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

arXiv enriched with GRID

Luigi routine to lookup arXiv author’s institutes via the GRID data, in order to “geocode” arXiv articles. The matching of institute name to GRID data is done via smart(ish) fuzzy matching, which then gives a confidence score per match.

class GridTask(*args, **kwargs)[source]

Bases: luigi.task.Task

Join arxiv articles with GRID data for institute addresses and geocoding.

Parameters:
  • date (datetime) – Datetime used to label the outputs
  • _routine_id (str) – String used to label the AWS task
  • db_config_env (str) – environmental variable pointing to the db config file
  • db_config_path (str) – The output database configuration
  • mag_config_path (str) – Microsoft Academic Graph Api key configuration path
  • insert_batch_size (int) – number of records to insert into the database at once (not used in this task but passed down to others)
  • articles_from_date (str) – new and updated articles from this date will be retrieved. Must be in YYYY-MM-DD format (not used in this task but passed down to others)
date = <luigi.parameter.DateParameter object>
test = <luigi.parameter.BoolParameter object>
db_config_env = <luigi.parameter.Parameter object>
db_config_path = <luigi.parameter.Parameter object>
mag_config_path = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
articles_from_date = <luigi.parameter.Parameter object>
output()[source]

Points to the output database engine

requires()[source]

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class GridRootTask(*args, **kwargs)[source]

Bases: luigi.task.WrapperTask

date = <luigi.parameter.DateParameter object>
db_config_path = <luigi.parameter.Parameter object>
production = <luigi.parameter.BoolParameter object>
drop_and_recreate = <luigi.parameter.BoolParameter object>
articles_from_date = <luigi.parameter.Parameter object>
insert_batch_size = <luigi.parameter.IntParameter object>
debug = <luigi.parameter.BoolParameter object>
requires()[source]

Collects the database configurations and executes the central task.