Data Model

The HKG (Helmholtz Knowledge Graph) internal data model is designed to ingest, validate, and transform heterogeneous metadata into a consistent, interoperable semantic RDF representation. It is built around Pydantic models, leverages schema.org vocabularies, and ensures output in valid JSON-LD, enabling semantic interoperability and integration into linked data ecosystems.

Implementation with Pydantic Models

At the core of the HKG data model are Pydantic models, which serve as the primary mechanism for:

Validating incoming data
Enforcing structural consistency
Ensuring a minimum level of data quality

Key Characteristics:

Strict typing: Each field is defined with explicit types (e.g., strings, URLs, nested objects).
Custom validation: Advanced constraints ensure semantic correctness.
Error handling:
- Invalid fields are dropped selectively, not globally (i.e. remaining parts of record are processed separateley).
- All validation issues are logged, allowing traceability without losing valid data.

This approach ensures that usable data is preserved, even when parts of the input are malformed.

Semantics

Schema.org

The HKG data model is primarily based on the schema.org vocabulary, a widely adopted standard for structured data on the web.

But why schema.org?

Web standard: Supported by major search engines and platforms
Broad adoption: Used across domains, increasing interoperability
Graph-based model: Naturally compatible with knowledge graphs and JSON-LD
Extensibility: Can be combined with other ontologies

Using schema.org ensures that HKG data is FAIR (Findable, Accessible, Interoperable, Reusable) and compatible with external systems.

Semantic Extentions

To cover domain-specific requirements beyond schema.org, the model integrates:

PROV-O (Provenance Ontology)
→ For representing provenance, derivation, and lineage of data
Helmholtz Digitisation Ontology (HDO)
→ For Helmholtz-specific extensions and metadata
DDI Discovery and DataCite
→ For Instrument types, to ensure domain-specific precision

These additional vocabularies allow modeling of specialized attributes not available in schema.org while maintaining semantic rigor.

Typing

The Helmholtz KG data model may aggregate multiple schema.org types into a single internal representation.

For instance the SoftwareModel combines:

schema.org/SoftwareApplication
schema.org/SoftwareSourceCode

Note:

While most models types are derived only from Schema.org, the Instrument model uses specialized external vocabularies to ensure domain-specific precision (see Semantic Extension)

Types unknown to the HKG model are preserved and added to the graph as-is.
This ensures no information loss, even for unsupported or future extensions.

JSON Schema Generation

The data model is formalized and JSON Schema documents on a per-model basis, to define the expected structure and constraints of valid input data.

Schemas are automatically generated from the Pydantic models and stored in our codebase in order to:

Provide a machine-readable contract for input data
Enable external validation before ingestion
Support integration with tools and pipelines

Note:

Examples for JSON records that validate against our JSON Schema and Pydantic pipeline can be found in our Record Example section

Implementation with Pydantic Models​

Semantics​

Schema.org​

Semantic Extentions​

Typing​

JSON Schema Generation​