9.2. Data harvesting: extracting metadata from the web#
How does UnHIDE harvested data?
Data harvesting and mining for the knowledge graph is done by Harvester classes.
For each interface a specific Harvester class should be implemented.
All Harvester classes should inherit from existing Harvesters or the BaseHarvester, which currently specifies that:
- Each harvester needs a - runmethod
- Can read from the - config.yml
- Reads from a - <harvesterclass>.last_runfile the time the harvester was last run
Implemented harvester classes include:
| Name (Cli) | Class Name | Interface | Comment | 
|---|---|---|---|
| sitemap | SitemapHarvester | sitemaps | Selecting record links from the sitemap requires expression matching. Relies on the advertools lib. | 
| oai | OAIHarvester | OAI-PMH | Relies on the oai lib. For the library providers, dublin core is converted to schema.org | 
| git | GitHarvester | Git, Gitlab/Github API | Relies on codemetapy and codemeta-harvester as well as gitlab/github APIs. | 
| datacite | DataciteHarvester | REST API & GraphQL endpoint | schema.org extracted through content negotiation. | 
| feed | FeedHarvester | RSS & Atom Feeds | Relies on the atoma library, and also only works if on the landing pages schema.org metadata can be extracted. Can only get recent data, useful for event metadata. | 
| indico | IndicoHarvester | Indico REST API | Directly extracts schema.org metadata through API, requires an access token | 
Json-ld metadata from landing pages of records is extracted via the extruct library, if it cannot be directly retrieved through some standardized interface.
All harvesters are exposed on the hmc-unhide commandline interface.
They store the extracted metadata per default in the internal data model LinkedDataObject.
Which has a serialization with some provenance information, original source data and uplifted data and provides method for validation.
In a single central yaml configuration file called config.yml, specifies for each harvester class the sources to harvest and harvester or source specific configuration.
