Data Pipelines
The Helmholtz KG harvests semantically heterogeneous metadata from various sources using a wide range of methods. Once harvested into a storage layer, data processing, including mapping, validation and enrichment ensures semantic integration into the Helmholtz KG. Semantic integration ensures that metadata from heterogeneous sources is consistently structured, reliably identified, and semantically aligned, enabling meaningful connections between entities such as datasets, publications, organizations, and researchers. In a final step this integrated data is injected into a graph database as RDF triples where they form the ground truth source of knowledge for the Helmholzt KG. The graph setup and provision of the various interfaces for access is described in Graph Setup & Interfaces.
The Helmholtz-KG operates on a fully automated weekly pipeline managed by Apache Airflow. The processes are deployed for
- Harvesting publicly available metadata from [Helmholtz sources](docs/DataProv/Data Sources.mdx),
- Processing metadata within our internal system (map, validate & enrich source data),
- Assembling the Knowledge Graph by injecting triples into our graph database and indexing them to optimize search performance.
The following diagram shows how metadata flows through each transformation stage, from external sources to the RDF Graph:
