Semantic Integration

Data within the oil and gas domain typically resides in several different sources and can have vastly different forms and access methods. In order to ensure optimal decision making all of this data must be taken into account; an end-user needs to be able to view and understand all data.

Accessing the data in their legacy format requires in-depth, low-level knowledge of how the data is stored, which is a considerable challenge for end-users. By integrating all data under a common ontology, users can view and explore the data in a language they understand. This research program aims at addressing issues that come up during this integration process, in particular by designing and developing scalable infrastructure for the integration of multiple large datasets and large-scale ontologies.

Semantic Integration

The Semantic Integration research program designs and develops scalable infrastructure that supports semantic integration using large ontologies (with many thousands of classes) and massive data sets (many billions of tuples). It will demonstrate the efficacy of these tools through deployment in the beacon projects. Specifically, we work with ontology reasoners capable of supporting the development of large-scale ontologies and semantic data stores which answer realistic ontology-based queries over massive data sets.

Digitization of oil and gas depends on integrating data from different sources. These sources have different forms and access methods. For example, some data owners may make data available in a raw format, or some may make it available only through their custom Application Programming Interfaces (APIs). The end user of this data wants a uniform view of the data, without the need to understand the underlying, often low-level methods needed to retrieve the data.

Most of the data in the oil and gas industry resides in a traditional database management system (DBMS). This data is usually accessed using queries written in the SQL language. However, it is challenging for non-experts to use SQL, especially when they want to access and use heterogeneous data. We do not want to expose this data directly to the end user. Instead we want to allow the user to ask for the data using their description of reality. This is the vision behind semantic integration.  SIRIUS is working on two possible ways of providing semantic integration: query rewriting and materialization.

Query rewriting. The most efficient way of accessing data depends on its representation format. For example, data in DBMS is in tabular format. This means that the query language, SQL, for that is such that it expects tabular form data input and gives tabular data output. However, one can design other data representation methods, and just define a mapping of one method to another. For instance, RDF graphs can be stored in a DBMS by serialization of the graph edge-wise and storing it as a table. This means that the corresponding graph querying language designed for RDF, SPARQL, must be mapped to SQL. This process is called query rewriting or mapping.

The advantage of query rewriting is that it gives flexibility to use any data storage and representation format at the backend, and any other format for querying, as long as the new querying method (language) can be formally mapped to the data storage format.

This approach was used successfully in the Optique EU project, where queries were written in a semantic form and then mapped to SQL databases owned by Statoil (now Equinor) and Siemens. The tool used for query rewriting was ONTOP, which is a query rewriting engine for the OWL 2 Query Language. The ONTOP project is hosted by the Free University of Bolzano, and SIRIUS is continuing to contribute to the development of this tool. Our plans are to extend ONTOP with mechanisms that support aggregation queries and queries for analytics.

Materialization: An alternative to query rewriting is materialization of a query. This involves copying, materializing, the data that is needed for a query into a format that makes the query efficient and allows ontology-based reasoning. This allows us to interpret the backend data and infer additional data about this data by using clever querying methods. RDF and associated ontologies provide a rich resource for doing this. An ontology allows you to define rules of interpreting data about data, which means additional data can be generated, giving us more insight into the existing data. Consider a simple example: a compressor is labelled in one database using a NORSOK-format tag and in another using a serial number. An ontology can define a rule which states that these two names indicate the same equipment, say compressor1 is same as compressor2 and compressor2 is the same as compressor3. Then the materialization process interprets that compressor1 is same as compressor3 also, by transitivity. Thus, query rewriting and materialization together allow flexible access and interpretation about data stored in any format.

Materialization can be done effectively using RDFox, which is a state-of-the-art triple-store or graph database. RDFox is uniquely capable of answering queries over more than 10 billion facts, where the answer also accounts for the knowledge represented in large ontologies. RDFox supports the OWL 2 RL profile for ontologies and the SPARQL query language. Additional features include non-tree shaped rules, arithmetic/aggregation functions, stratified negation as failure and incremental reasoning

Work on RDFox started in 2014, and it is still under active development. In SIRIUS we have been investigating how extend RDFox to support streaming data, and how to distribute RDFox reasoning over a cluster. The former is important in settings where streaming data interacts with complex domain ontologies; this interaction makes it difficult to use traditional window-based methods of dealing with data streams. The latter is important for dealing with very large data sets as RDFox stores data in memory for efficient reasoning. We are also working to overcome the limits imposed by its main-memory design by completing the development and evaluation of a fully distributed version. And we are extending its capabilities (e.g., with support for streaming data) and improving optimisations (e.g., query planning).

In 2018, we have also designed and implementation of a new ontology reasoner, called Sequoia. This, for the first time, applies consequence-based reasoning to the entire OWL language. Consequence based reasoning was until now only applicable to subsets of OWL. Sequoia already outperforms state-of-the-art reasoners on hard ontologies. In 2019 we will work further to develop algorithms and our prototype implementation into a fully-fledged OWL reasoner that significantly advances the state of the art.

More information and news from this program