9.2. Data harvesting: extracting metadata from the web#

How does UnHIDE harvested data?

Data harvesting and mining for the knowledge graph is done by Harvester classes. For each interface a specific Harvester class should be implemented. All Harvester classes should inherit from existing Harvesters or the BaseHarvester, which currently specifies that:

  1. Each harvester needs a run method

  2. Can read from the config.yml

  3. Reads from a <harvesterclass>.last_run file the time the harvester was last run

Implemented harvester classes include:

Name (Cli)

Class Name

Interface

Comment

sitemap

SitemapHarvester

sitemaps

Selecting record links from the sitemap requires expression matching. Relies on the advertools lib.

oai

OAIHarvester

OAI-PMH

Relies on the oai lib. For the library providers, dublin core is converted to schema.org

git

GitHarvester

Git, Gitlab/Github API

Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs.

datacite

DataciteHarvester

REST API & GraphQL endpoint

schema.org extracted through content negotiation.

feed

FeedHarvester

RSS & Atom Feeds

Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata.

indico

IndicoHarvester

Indico REST API

Directly extracts schema.org metadata through API, requires an access token

Json-ld metadata from landing pages of records is extracted via the extruct library, if it cannot be directly retrieved through some standardized interface.

All harvesters are exposed on the hmc-unhide commandline interface. They store the extracted metadata per default in the internal data model LinkedDataObject. Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation.

In a single central yaml configuration file called config.yml, specifies for each harvester class the sources to harvest and harvester or source specific configuration.