Data Harvesting

Metadata is harvested via established standards like OAI-PMH or SITEMAP whenever possible. Harvesters for custom APIs are only employed and maintained where an API offers large quantities of data or semantically superior data compared to an alternative OAI-PMH endpoint or Sitemap URL. Harvesting methods are described below, a full list of data sources and methods how they are harvested can be found [here](docs/DataProv/Data Sources.mdx). Metadata is always harvested and stored in its original form and transformation are only carried out post-hoc harvesting to preserve metadata provenance. The implementation of the harvesting methods can be found in our harvester repository.

Harvesting Methods

OAI-PMH-Endpoints:

The OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is a widely used standard for exposing structured metadata of digital repositories, libraries and data catalogues. OAI-PMH provides a uniform interface that allows harvesters to retrieve metadata records in bulk through standardized requests in various metadata formats.

Details

Our OAI-PMH pipelines are currently primarily used to harvest metadata from Helmholtz library systems, which means that the majority of records collected through this method describe scholarly documents such as journal articles, books, theses, and other publication-related materials. At present, the KG harvests several metadata formats exposed through OAI-PMH, including OAI-DC (Dublin Core) and MARCXML, both of which are commonly supported by library infrastructures.

In the future, support for OAI-DataCite and OAI-OpenAire is planned in order to integrate richer metadata for research outputs that are registered through DOI-based infrastructures.

DataCite-API:

DataCite records are retrieved directly through the DataCite REST API. DataCite provides rich metadata for research outputs associated with DOIs, including datasets, software, publications, and other research products. The data is provided in the DataCite schema as well as in Schema.org. Records associated with Helmholtz centers are identified based on the presence of their corresponding ROR (Research Organization Registry) identifier.

Details

The DataCite harvester currently does not rely on DOI prefixes to identify records associated with specific Helmholtz infrastructures, as these prefixes are not comprehensively documented or centrally maintained. Instead, the harvester queries DataCite's GraphQL API for records that reference a curated list of ROR identifiers, ensuring that only assets explicitly linked to Helmholtz organizations are integrated into the graph. This approach provides a clear and reliable scope for harvesting while avoiding the risk of unintentionally ingesting unrelated records.

While accessing DataCite metadata allows the Helmholtz KG to integrate widely used, DOI-based metadata that is already curated and maintained by repositories and data providers, we recognize that our current methods are restrictive: many records may reference Helmholtz centers in textual affiliation fields without including a corresponding ROR identifier and are therefore currently not captured. Future development of the DataCite harvester will aim to refine this strategy by incorporating additional knowledge about Helmholtz-related infrastructures, repositories, and DOI prefixes. These improvements will allow the pipeline to identify relevant records more comprehensively while maintaining a well-defined scope for the Helmholtz KG.

DataCite provides a native schema.org mapping via the Bolognese mapping tool. Instead of defining a custom mapping the KG includes this mapped version in the harvesting process.

ROR-API:

The KG harvests organization metadata from the ROR API based on a curated list of Helmholtz ROR identifiers. Records are retrieved in the native ROR schema and include standardized information on institutions and their relationships. A custom mapping layer transforms this data into the internal KG model.

Indico-API

Event metadata is harvested from Indico instances used across Helmholtz via their REST APIs. The retrieved records follow the native Indico data structure, describing events such as conferences, workshops, and seminars. This metadata is integrated into the Helmholtz KG through a custom mapping process, aligning event information with the internal data model and linking it to related entities

Sitemap URLs:

In addition to harvesting metadata from APIs and repository endpoints, the Helmholtz Knowledge Graph harvests metadata from the web by crawling website sitemaps. This approach enables the KG to harvest metadata from infrastructures that publish structured web metadata but do not expose dedicated APIs, while relying on widely adopted semantic web standards designed for interoperability and machine readability.

Recommended method

We recommend sitemaps as the designated way to connect repositories to the Helmholtz KG because: (A) Exposing metadata embedded in HTML script headers increases individual visibility on the web, as it will also be found by Google and other web crawlers independently of the integration and representation by the Helmholtz KG, and - (B) while crawling is typically slower and less controlled, it is generally more stable and independent of changes and updates of the endpoint itself.

Details

How does sitemap crawling of web-metadata work? XML sitemaps provide structured lists of URLs that belong to a website and are typically intended for search engines to discover pages efficiently. The crawling pipeline retrieves these sitemap files, iterates over the listed URLs, and visits the corresponding pages — most often the landing pages of digital assets such as datasets, publications, or software records.

During this process, the crawler scans the HTML of each page for embedded Schema.org metadata expressed as JSON-LD. This metadata is typically included in a /<script type="application/ld+json"/> tag within the page source. The JSON-LD block contains a structured description of the resource using Schema.org vocabularies, including properties such as identifiers, creators, licenses, and links to related entities. When such metadata is detected, the crawler extracts the JSON-LD content which can then be validated and converted it into RDF representations that can be integrated into the Helmholtz KG.

Data Storage (RAW Records)

Following the harvesting stage all records are stored within a PostgreSQL database to preserve record provenance and allow future re-mapping (e.g. when extending the data model). Every record is assigned a PURL within a designated namespace our namespace purls.helmholtz-metadaten.de/helmholtzkg_api to ensure that it can be globally uniquely identified. The raw data is available to the public through the Data Storage API.

Harvesting Methods​

OAI-PMH-Endpoints:​

DataCite-API:​

ROR-API:​

Indico-API​

Sitemap URLs:​

Data Storage (RAW Records)​

Harvesting Methods

OAI-PMH-Endpoints:

DataCite-API:

ROR-API:

Indico-API

Sitemap URLs:

Data Storage (RAW Records)