Data Processing
Duing our internal processing the source metadata is transformed into the Helmholtz Knowledge Graph data model through a dedicated mapping and validation stage which improves consistency and connectivitiy of the ingested data.
The internal model is implemented using Pydantic models and JSON Schema, and is largely derived from Schema.org types. This ensures both structural consistency and interoperability with widely adopted standards. A custom Python-based mapping engine retrieves the harvested records from the data storage layer and converts them into this unified representation as described in the following.
Mapping
The HKG data model is built on schema.org-based Pydantic models, which means that all harvested data need to be transformed into a schema.org-compliant representation. In the ideal case, incoming data already follows the schema.org vocabulary and can be directly ingested into the data model without significant transformation. However, in the current state, even data that is formally based on schema.org often requires additional normalization to meet the stricter validation requirements of the Pydantic models (e.g. via Sitemap harvester).
When data originates from other schemas, a best-effort mapping approach is applied to translate the source structures into schema.org/ the Helmholtz KG data model. This mapping is typically tailored to the specific data source and aims to find the closest semantic equivalents, even if some approximation is necessary. Looking ahead, this process will be further standardized by the use of SSSOM files in upcoming updates.
The following detailed mappings are currently in use:
DataCite
Datacite metadata can already be found in schema.org format, such that only minor modifications are needed here.
As there is no class type Instrument in schema.org metadata the datacite schema is checked on the field resourceTypeGeneral
for the value Instrument to still identify instruments.
Details
- Datacite provides keywords as a comma sepearated string. This string is split up at
,and transformed to real list. - For
@idproperties ofauthorentities the proper domain forORCIDs andRORs are added if missing.
Indico
Indico delivers data in its own bespoke schema. Consequently, incoming data needs to be mapped to schema.org/ Helmholtz KG data model.
Every entity that is found via Indico is assumed to be of class type Event.
Details
titleis mapped to schema.org propertyname.addressis mapped to schema.org propertylocationchairsandcreatorare mapped to schema.org propertycontributor:- All entities are assumed to be of class type
Person fullNameis mapped to schema.org propertyname.last_nameis mapped to schema.org propertyfamilyName.first_nameis mapped to schema.org propertygivenName.affiliationis mapped to a schema.org entity of typeOrganizationwith the given string as its name.
- All entities are assumed to be of class type
OAI
The mapping for data found via OAI-PMH highly depends on the provided metadata schema.
For now only Dublin Core and Marc21 are available. Incoming data is mapped into schema.org / the Helmholtz KG data model as described in the following:
If no class type can be inferred CreativeWork is assumed.
Details
Marc21:
- Fields 100a, 110a, 111a, and 700a are mapped to schema.org property
authoras anameof an entity of class typePerson. - Fields 260c and 264c are mapped to schema.org property
dateCreated. - Fields 245a and 245b are concatenated and mapped to schema.org property
name. - Fields 260b and 264b are mapped to schema.org property
publisher. - Fields 24 are mapped to schema.org property
identifier. - Fields 540f are mapped to schema.org property
license.
Dublin Core:
typeentries are used to infer a schema.org class type via a mapping table.creatorentries are mapped to schema.org propertyauthoras anameof an entity of class typePerson.dateentries are mapped to schema.org propertydateCreated.publisherentries are mapped to schema.org propertypublisher.identifierentries are mapped to schema.org propertyidentifier.licenseentries are mapped to schema.org propertylicense.descriptionentries are concatenated and mapped to schema.org propertydescription.subjectentries are mapped to schema.org propertykeywords.
ROR
ROR provides data using its own bespoke schema which needs to be mapped to schema.org / the Helmholtz KG data model.
Details
typesentries are mapped to a schema.org class type via a mapping table.namesentries having the attributeror_displayare taken as schema.org propertyname. Remark: The first entry is also taken as HDO propertyprefNamenamesentries having at least one of the attributes 'label', 'alias' or 'acronym' are taken as schema.org propertyalternateName.external_idsentries are mapped to schema.org propertyidentifieras an entity of class typePropertyValue.relationshipsentries having the attributeparentare mapped to schema.org propertyparentOrganizationas an entity of class typeOrganization.relationshipsentries having the attributechildare mapped to schema.org propertysubOrganizationas an entity of class typeOrganization.linksentries having the typewebsiteare taken as schema.org propertyurl.
Schema.org (Sitemap)
Sitemap metadata can already be found in schema.org format, such that only minor modifications are required for integration.
Details
- If
descriptioncontains a list of descriptions these are joined to one single description. - If keywords are provided as a comma sepearated string it is split up at
,and transformed to a real list. - Some data sources provides people's affiliation as a string which is not compliant with
schema.org. In that case the string is used as the name for a proper affiliation object.
Data Validation
Input data handling
Any JSON-LD document can be ingested for the mapping and validation process. However, it is preferred to use JSON-LD which conforms to the defined JSON Schemas.
Handling @id
When mapping entities, the HKG data model enforces globally unique and valid identifiers using the following logic:
| Type of ID | Handling |
|---|---|
| 1. Absolute URL provided | If the input contains a valid absolute URL (validated via Pydantic’s AnyHttpUrl): → It is used directly as @id after further normalization. |
| 2. Relative identifier provided | If a non-URL identifier is given: → It is prefixed with the HKG Graph Prefix ( purls.helmholtz-metadaten.de/helmholztkg/) → Result: a globally unique, resolvable identifier |
| 3. No identifier provided | Fallback option: →A Persistent Identifier (ID) is generated automatically (see ID Generation) |
ID Generation
The goal of ID generation is:
- Same entities → same ID
- Different entities → different ID
However, this is inherently challenging because without an existing identifier,
it is difficult to determine if two entities are identical. For this, the HKG model implements a best-effort deterministic approach during which an ID is hashed and made dereferencable as IRIs within our HKG Graph Prefix() purls.helmholtz-metadaten.de/helmholztkg/).
- ID generation is always based on the input data of the entity
- For nested entities:
- The PID also incorporates the first parent element that has a valid PID
- If no parent has an ID: → The entire input document is used as the basis
ID generation can be customized per model type.
For these models:
DataCatalogModelInstitutionModel
the ID generation is based only on the entity’s own input data.
This means, the parent context is ignored.
PID normalization
Where valid absolute URLs are provided, they are used directly as the @id of a node.
Due to variations in upstream data sources, known persistent identifiers (PIDs) do not always enter the Helmholtz KG in
a harmonized form. To prevent unintended duplication of entities, we apply normalization steps to these identifiers
during processing. These steps are restricted to well-defined PID systems (e.g., DOI, ROR, ORCID), where normalization
does not alter the canonical meaning of the identifier. For general IRIs that may include non-ASCII characters,
such transformations are avoided to preserve their original semantics.
The following normalization steps are carried out:
-
Enforcement of HTTPS Example:
http://doi.org/10.XXXX/any_identifier→https://doi.org/10.XXXX/any_identifier -
Removal of
www.subdomain Example:https://www.doi.org/10.XXXX/any_identifier→https://doi.org/10.XXXX/any_identifier -
Normalize capitalization usage for PIDs:
PID TYPE capitalization DOI lowercase ROR lowercase ORCID uppercase Example:
https://doi.org/10.XXXX/Any_Identifier→https://doi.org/10.XXXX/any_identifier
Invalid Data
If a specific attribute value is invalid, only that value is discarded. The rest of the entity remains intact and all issues are logged for traceability
➡️ This ensures robustness and prevents unnecessary data loss.
Data Enrichment
The mapping process applies lightweight semantic enrichment. This includes automated type inference, where missing entity
types are derived from the usage of specific attributes based on Schema.org domain and range definitions.
For example, the presence of an attribute such as affiliation may lead to the classification of an entity as a
Person due to the property being unique to that class type.
This improves the overall structure and query ability of the graph.
