9.2. Data harvesting: extracting metadata from the web#
How does UnHIDE harvested data?
Data harvesting and mining for the knowledge graph is done by Harvester classes
.
For each interface a specific Harvester class should be implemented.
All Harvester classes should inherit from existing Harvesters or the BaseHarvester
, which currently specifies that:
Each harvester needs a
run
methodCan read from the
config.yml
Reads from a
<harvesterclass>.last_run
file the time the harvester was last run
Implemented harvester classes include:
Name (Cli) |
Class Name |
Interface |
Comment |
---|---|---|---|
sitemap |
SitemapHarvester |
sitemaps |
Selecting record links from the sitemap requires expression matching. Relies on the advertools lib. |
oai |
OAIHarvester |
OAI-PMH |
Relies on the oai lib. For the library providers, dublin core is converted to schema.org |
git |
GitHarvester |
Git, Gitlab/Github API |
Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs. |
datacite |
DataciteHarvester |
REST API & GraphQL endpoint |
schema.org extracted through content negotiation. |
feed |
FeedHarvester |
RSS & Atom Feeds |
Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata. |
indico |
IndicoHarvester |
Indico REST API |
Directly extracts schema.org metadata through API, requires an access token |
Json-ld metadata from landing pages of records is extracted via the extruct
library, if it cannot be directly retrieved through some standardized interface.
All harvesters are exposed on the hmc-unhide
commandline interface.
They store the extracted metadata per default in the internal data model LinkedDataObject
.
Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation.
In a single central yaml configuration file called config.yml
, specifies for each harvester class the sources to harvest and harvester or source specific configuration.