Skip to main content

Graph Setup & User Interfaces

Once the data has run through the entire pipeline and passed all harvesting, mapping and processing steps, it is injected to the Helmholtz KG. The following diagram gives and overview of the setup that is used to store this data and enables user access.

RDF graph

After harvesting and mapping, all metadata is stored in the OpenLink Virtuoso graph database, which serves as the single source of truth for every graph interface. The graph can be accessed directly from this database via its Virtuoso SPARQL endpoint.

As the Helmholtz Knowledge Graph is rapidly growing, comprising currently more than 250 million triples, an additional graph index is build using QLever to improve query performance. The QLever SPARQL endpoint provides significantly faster query execution and should therefore be used as the primary endpoint for most use cases.

Please note that the hardware resources of the SPARQL endpoints are limited. Complex queries that produce large Cartesian products, for example, through extensive use of UNION statements, may result in out of memory errors. See SPARQL: Querying the Helmholtz Knowledge Graph for further details.

Web UI

The graph can also be explored via the graphical user interface. The UI supports free-text search and categorical filtering, enabling intuitive data exploration without requiring knowledge of a specific query language. This search functionality arises from an OpenSearch index which is build directly from the graph. In a nutshell, entities are grouped according to the categories defined in the data model and indexed in OpenSearch in combination with their directly connected entities.

This makes it possible to search not only entities themselves but also for their immediate relationships. For example, looking up a Person will reveal not only core information like name and identifier, but also related entities such as documents authored by that Person or affiliated Institutions. While this approach preserves the connected nature of the underlying graph, only direct relationships can be identified through this interface. More complex queries require the use of SPARQL. For further details, see the graphical user interface documentation.

Indexing of RDF data

The final step bridges the gap between a complex graph and a fast, searchable web interface.

  • The QLever Mirror: Before indexing begins, an automated job dumps the Virtuoso graph into QLever. This provides a high-performance environment for the intensive SPARQL queries required for the next stage.
  • Graph-to-Relational Mapping: A specialized repository queries QLever (instead of Virtuoso) to leverage its superior read performance. It uses the internal data model to flatten complex graph connections (e.g., links between Documents, People, and Organizations) into an OpenSearch index.

Categorization Logic

To keep the UI organized, the indexer filters graph nodes into 8 primary categories based on their internal model types. If a graph node does not match one of these types, it remains in the Knowledge Graph but is not indexed for the UI.

CategoryModel Types / Vocabularies
DatasetsDataset
DocumentsArticle, ScholarlyArticle, Book, Chapter, Text, Periodical, Thesis, Report
SoftwareSoftwareApplication, SoftwareSourceCode
OrganizationsOrganization
InstitutionsOrganization, ResearchOrganization, EducationalOrganization, MedicalOrganization, FundingAgency, Corporation, ArchiveOrganization, NGO, GovernmentOrganization
ExpertsPerson
EventsEvent
InstrumentsSpecialized URIs: SociologicalInstrument (DDI-Discovery) and PhysicalInstrument (DataCite)

Note: While most categories are derived from Schema.org, the Instruments category uses specialized external vocabularies (e.g., DDI Discovery and DataCite) to ensure domain-specific precision. Types like CreativeWork or Intangible exist within the Knowledge Graph but are not currently indexed for the UI.