Skip to main content

Data Processing

Duing our internal processing the source metadata is transformed into the Helmholtz Knowledge Graph data model through a dedicated mapping and validation stage which improves consistency and connectivitiy of the ingested data.

The internal model is implemented using Pydantic models and JSON Schema, and is largely derived from Schema.org types. This ensures both structural consistency and interoperability with widely adopted standards. A custom Python-based mapping engine retrieves the harvested records from the data storage layer and converts them into this unified representation as described in the following.

Mapping

The HKG data model is built on schema.org-based Pydantic models, which means that all harvested data need to be transformed into a schema.org-compliant representation. In the ideal case, incoming data already follows the schema.org vocabulary and can be directly ingested into the data model without significant transformation. However, in the current state, even data that is formally based on schema.org often requires additional normalization to meet the stricter validation requirements of the Pydantic models (e.g. via Sitemap harvester).

When data originates from other schemas, a best-effort mapping approach is applied to translate the source structures into schema.org/ the Helmholtz KG data model. This mapping is typically tailored to the specific data source and aims to find the closest semantic equivalents, even if some approximation is necessary. Looking ahead, this process will be further standardized by the use of SSSOM files in upcoming updates.

The following detailed mappings are currently in use:

DataCite

Datacite metadata can already be found in schema.org format, such that only minor modifications are needed here. As there is no class type Instrument in schema.org metadata the datacite schema is checked on the field resourceTypeGeneral for the value Instrument to still identify instruments.

Details
  1. Datacite provides keywords as a comma sepearated string. This string is split up at , and transformed to real list.
  2. For @id properties of author entities the proper domain for ORCIDs and RORs are added if missing.

Indico

Indico delivers data in its own bespoke schema. Consequently, incoming data needs to be mapped to schema.org/ Helmholtz KG data model. Every entity that is found via Indico is assumed to be of class type Event.

Details
  1. title is mapped to schema.org property name.
  2. address is mapped to schema.org property location
  3. chairs and creator are mapped to schema.org property contributor:
    1. All entities are assumed to be of class type Person
    2. fullName is mapped to schema.org property name.
    3. last_name is mapped to schema.org property familyName.
    4. first_name is mapped to schema.org property givenName.
    5. affiliation is mapped to a schema.org entity of type Organization with the given string as its name.

OAI

The mapping for data found via OAI-PMH highly depends on the provided metadata schema.
For now only Dublin Core and Marc21 are available. Incoming data is mapped into schema.org / the Helmholtz KG data model as described in the following: If no class type can be inferred CreativeWork is assumed.

Details

Marc21:

  1. Fields 100a, 110a, 111a, and 700a are mapped to schema.org property author as a name of an entity of class type Person.
  2. Fields 260c and 264c are mapped to schema.org property dateCreated.
  3. Fields 245a and 245b are concatenated and mapped to schema.org property name.
  4. Fields 260b and 264b are mapped to schema.org property publisher.
  5. Fields 24 are mapped to schema.org property identifier.
  6. Fields 540f are mapped to schema.org property license.

Dublin Core:

  1. type entries are used to infer a schema.org class type via a mapping table.
  2. creator entries are mapped to schema.org property author as a name of an entity of class type Person.
  3. date entries are mapped to schema.org property dateCreated.
  4. publisher entries are mapped to schema.org property publisher.
  5. identifier entries are mapped to schema.org property identifier.
  6. license entries are mapped to schema.org property license.
  7. description entries are concatenated and mapped to schema.org property description.
  8. subject entries are mapped to schema.org property keywords.

ROR

ROR provides data using its own bespoke schema which needs to be mapped to schema.org / the Helmholtz KG data model.

Details
  1. types entries are mapped to a schema.org class type via a mapping table.
  2. names entries having the attribute ror_display are taken as schema.org property name. Remark: The first entry is also taken as HDO property prefName
  3. names entries having at least one of the attributes 'label', 'alias' or 'acronym' are taken as schema.org property alternateName.
  4. external_ids entries are mapped to schema.org property identifier as an entity of class type PropertyValue.
  5. relationships entries having the attribute parent are mapped to schema.org property parentOrganization as an entity of class type Organization.
  6. relationships entries having the attribute child are mapped to schema.org property subOrganization as an entity of class type Organization.
  7. links entries having the type website are taken as schema.org property url.

Schema.org (Sitemap)

Sitemap metadata can already be found in schema.org format, such that only minor modifications are required for integration.

Details
  1. If description contains a list of descriptions these are joined to one single description.
  2. If keywords are provided as a comma sepearated string it is split up at , and transformed to a real list.
  3. Some data sources provides people's affiliation as a string which is not compliant with schema.org. In that case the string is used as the name for a proper affiliation object.

Data Validation

Input data handling

Any JSON-LD document can be ingested for the mapping and validation process. However, it is preferred to use JSON-LD which conforms to the defined JSON Schemas.

Handling @id

When mapping entities, the HKG data model enforces globally unique and valid identifiers using the following logic:

Type of IDHandling
1. Absolute URL providedIf the input contains a valid absolute URL (validated via Pydantic’s AnyHttpUrl):
→ It is used directly as @id after further normalization.
2. Relative identifier providedIf a non-URL identifier is given:
→ It is prefixed with the HKG Graph Prefix (purls.helmholtz-metadaten.de/helmholztkg/)
→ Result: a globally unique, resolvable identifier
3. No identifier providedFallback option:
→A Persistent Identifier (ID) is generated automatically (see ID Generation)

ID Generation

The goal of ID generation is:

  • Same entities → same ID
  • Different entities → different ID

However, this is inherently challenging because without an existing identifier, it is difficult to determine if two entities are identical. For this, the HKG model implements a best-effort deterministic approach during which an ID is hashed and made dereferencable as IRIs within our HKG Graph Prefix() purls.helmholtz-metadaten.de/helmholztkg/).

  • ID generation is always based on the input data of the entity
  • For nested entities:
    • The PID also incorporates the first parent element that has a valid PID
    • If no parent has an ID: → The entire input document is used as the basis

ID generation can be customized per model type.
For these models:

  • DataCatalogModel
  • InstitutionModel

the ID generation is based only on the entity’s own input data.
This means, the parent context is ignored.

PID normalization

Where valid absolute URLs are provided, they are used directly as the @id of a node. Due to variations in upstream data sources, known persistent identifiers (PIDs) do not always enter the Helmholtz KG in a harmonized form. To prevent unintended duplication of entities, we apply normalization steps to these identifiers during processing. These steps are restricted to well-defined PID systems (e.g., DOI, ROR, ORCID), where normalization does not alter the canonical meaning of the identifier. For general IRIs that may include non-ASCII characters, such transformations are avoided to preserve their original semantics.

The following normalization steps are carried out:

  1. Enforcement of HTTPS Example: http://doi.org/10.XXXX/any_identifierhttps://doi.org/10.XXXX/any_identifier

  2. Removal of www. subdomain Example: https://www.doi.org/10.XXXX/any_identifierhttps://doi.org/10.XXXX/any_identifier

  3. Normalize capitalization usage for PIDs:

    PID TYPEcapitalization
    DOIlowercase
    RORlowercase
    ORCIDuppercase

    Example: https://doi.org/10.XXXX/Any_Identifierhttps://doi.org/10.XXXX/any_identifier

Invalid Data

If a specific attribute value is invalid, only that value is discarded. The rest of the entity remains intact and all issues are logged for traceability

➡️ This ensures robustness and prevents unnecessary data loss.

Data Enrichment

The mapping process applies lightweight semantic enrichment. This includes automated type inference, where missing entity types are derived from the usage of specific attributes based on Schema.org domain and range definitions. For example, the presence of an attribute such as affiliation may lead to the classification of an entity as a Person due to the property being unique to that class type. This improves the overall structure and query ability of the graph.