Domain-Adapted Data Science

Industry-relevant data comes in many forms, from structured sources (e.g., databases) and unstructured sources (e.g., natural language documents intended to be read by humans). Having access to all this data is only as useful as the methods one has for evaluating and using this data to make decisions.

 

This research program particular focuses on developing approaches that combine the use of structured knowledge with learning from data in the machine learning process.

Within the artificial intelligence (AI) research community and beyond that community there is interest in developing strong AI, which means intelligent machines that are indistinguishable from the human mind or that go beyond human-level intelligence (superintelligence). However, despite the impressive progress in the field over the last decades, we still do not know how to achieve strong AI.

 

As an example of human capability, imagine a child who has seen the usual animals that live on Norwegian farms, such as sheep and horses, but has never seen a giraffe, neither in person nor in pictures. If the child has sufficient language skills, then you can tell them, before going to a zoo, that a giraffe is an animal that looks like a horse but with a very long neck. Then, at the zoo, most likely the child will correctly identify a giraffe as a giraffe with ease.

This example shows that humans can combine what they have learned from experience (in our example, what they have seen before) with declarative statements (the description of the similarities and differences between giraffes and horses). Machines are not yet good at that.

 

In the AI community, there has been a long-standing frontier, labeled as the discussion about «symbolic vs. non-symbolic AI». In symbolic AI, information is structured in ontologies and deductions are made via reasoning. In sub-symbolic AI, information is obtained from data, and deductions are made via machine learning (ML). One hypothesis is that machines will become better at learning if we can combine these two types of information in the learning process. That is, combine information in terms of declarative statements and ontologies with information encoded in data made available through statistics and machine learning.

 

The focus of the SIRIUS domain adapted data science program is exactly to develop approaches that combine the use of structured knowledge with learning from data in the machine learning process. On one hand, this means that we try to bridge the traditional divide between symbolic and sub-symbolic learning, developing what we refer to as hybrid approaches. On the other hand, as it turns out, hybrid approaches yield improved machine learning results, and especially on «not so big-data». Combined with the fact that symbolically represented knowledge can often be very small and concise, this is a powerful tool that makes machine learning available on datasets that are otherwise too small, or otherwise unfit, for classical machine learning tasks. The overarching goal is a general methodology for how data science tasks can be enhanced through the combined use of symbolic and sub-symbolic knowledge.

 

Another intriguing feature of hybrid approaches is that the presence of symbolic knowledge in the machine learning process may lead to more explainable predictions. Within our research program, we also develop novel hybrid approaches that identify and exploit these capabilities.

One term that we use to refer to a particular class of hybrid approaches is domain-adapted approaches. Very often in machine learning and data science tasks, data in the form of textual documents, images, or tables is processed, where making use of domain knowledge, for example in the form of an ontology, can improve the results.

Furthermore, we carry out research related to explainable AI (XAI). The current approaches and technologies in XAI mostly focus on shedding light on the behavior of black-box machine learning models (like deep neural networks) by explaining their decisions to the users. However, the least work has been done towards employing the information provided by the explanations for enhancing the models concerning accuracy, fairness, and robustness in a systematic way. In SIRIUS, we have studied this research area and devised explanation-based frameworks for investigating the accuracy and robustness of black-box ML classification models [1].

Challenges

  • How can we combine machine learning and machine knowing such that we can realize approaches that learn from experience (data) and with knowledge (e.g., ontologies).
  • How can we combine symbolically represented knowledge (e.g., as in ontologies) and distributionally represented knowledge (e.g., as in vector space embeddings)?
  • How can we combine symbolic processing (e.g., via rules) and neural processing (e.g., via deep neural networks)?
  • How can we build approaches that make use of both unstructured data (e.g., texts) and structured data (e.g., ontologies), such that all available knowledge can be taken into account for machine learning approaches?
  • How can we make machine learning approaches more explainable, e.g., by incorporating declarative domain knowledge?

Approach

Erik Bryhn Myklebust, Ernesto Jimenez-Ruiz, Jiaoyan Chen and colleagues have shown in the context of ecotoxicological effect prediction that the accuracy of predictions can be improved when domain knowledge is incorporated into the prediction model [2]. The disputation of Erik’s PhD thesis «Ecotoxicological Effect Prediction using a Tailored Knowledge Graph» took place in October 2022. At ISWC 2019 they received the Best In-Use Student Paper Award. Click here to read more about this project.
Ole Magnus Holter and Basil Ell develop approaches that make use of domain knowledge in the context of semantic parsing of textual requirements. Their goal is to formally represent (parts of) the meaning of textual requirements, so that the meaning of requirements, such as ”Shell boilers with a shell diameter of 1400 mm or greater shall be designed to permit entry of a person and shall be provided with a manhole for this purpose”, becomes more accessible to machines and the management of requirements can be improved [3]. Therefore, they make use of rule-based approaches, pre-trained large language models, declarative domain knowledge, and linguistic knowledge.

Egor V. Kostylev and colleagues study theoretical and practical connections between graph neural networks (GNNs), a modern structure-aware machine learning architecture, and classic logic-based knowledge representation formalisms. In particular, they designed a family of monotonic GNNs that allow for an efficient translation to Datalog logic-based language, and developed an efficient INDIGO system for knowledge graph completion [4,5]. These connections may be further developed to more general and tight connections between logical reasoning and statistical inference, as well as used in specific use cases.

In the context of a task relevant for the oil and gas industry, namely reservoir analogue identification, Summaya Mumtaz and Martin Giese have shown that a similarity measure based on the combination of domain knowledge (in the form of a taxonomy) with classical frequency-based features leads to significantly better results [6]. The disputation of Summaya’s PhD thesis «Hierarchy-based Similarity Measures and Embeddings Supporting Machine Learning by Knowledge» took place in November 2021. Click here to read more about this project.

In the context of classification, based on a use case that is relevant to the oil and gas industry, namely that of excess inventory reduction, Daniel Bakkelund has developed theory and methodology for improved classification of interchangeable equipment, by integrating equipment structure awareness into classical methods for unsupervised machine learning [7]. The disputation of Daniel’s PhD thesis «Order Preserving Hierarchical Clustering» took place in July 2022.

Jiaoyan Chen, Ernesto Jimenez-Ruiz, Ole Magnus Holter and colleagues have developed an ontology embedding framework named OWL2Vec* that can embed symbolic knowledge in an OWL ontology into a vector space, so that the information can be consumed by machine learning algorithms. OWL2Vec* can be directly applied to ontology completion tasks such as subsumption prediction as well as to help address machine learning challenges, such as sample shortage, by injecting symbolic knowledge [8,9].

Actionable recourse (AR) techniques are a popular class of post-hoc interpretability approaches that help the users of ML models to obtain their desired decision from a machine learning model. Given an individual’s preferences, an AR recommends feasible changes to their corresponding input that lead to the desired outcome by the model. To generate realistic ARs, it is important to capture and exploit the domain’s information and the preferences of the users in the explanation process. Peyman Rasouli and Ingrid Chieh Yu are working on a model-agnostic framework that combines user/domain-level knowledge with model/data-level information to create plausible ARs that can guide individuals to obtain their desired decision from any ML classification and regression model in a simple and efficient manner [10].

Current explainable artificial intelligence (XAI) techniques only rely on the observational data to analyze and explain the behavior of machine learning models. To increase the comprehensibility and faithfulness of explanations of ML models, hence, it is essential to exploit domain knowledge that bridges between the models and human concepts. Peyman Rasouli and Ingrid Chieh Yu aim to integrate domain knowledge (in the form of knowledge graphs and taxonomies) with structured/tabular data to provide more feasible, comprehensible, and faithful explanations. Peyman will submit his PhD thesis in November 2022.

Roxana Pop is investigating neural methods for temporal prediction on structured data, with an emphasis on graph neural networks extensions. Her overarching objective is to combine neural methods with symbolic AI approaches, and extract logical rules from the trained neural networks for making the predictions. In this way, the predictions can be understood by human experts, but the extraction is data-driven, making the best of the two AI paradigms.

Gong Cheng, Evgeny Kharlamov and colleagues investigated keyword-based exploration of knowledge graphs  [11,12] and proposed a novel method to generate smart snippets or summaries of large-scale knowledge graphs. Baifan Zhou, Evgeny Kharlamov and colleagues from SIRIUS showed how to facilitate developmentof ML models using semantic technologies [13]. Then, they investigated several practical aspects of knowledge graph management in connection to analytics and machine learning motivated by applications from Industry 4.0 [14,15]. That is, they showed how to scale usability of ML analytics and reshape industrial knowledge graphs. Moreover, Baifan and Evgeny consolidated a number of research directions into an advanced SIndAIS4 project of SIRIUS that aims at Scaling Industrial AI with Semantics in four directions: human, data, methods, and applications. Within this project and together with Ahmet Soylu they selected several Bosch-funded interns – students of Ahmet – thus strengthening the Bosch-SIRIUS collaboration and disseminating it in two large Norwegian universities: NTNU and OsloMet.

In 2018, Farhad Nooralahzadeh and colleagues demonstrated a system for extracting relations from scientific text [18]. This tool was built with convolutional neural networks using semi-supervised learning and structured domain knowledge. The tool was a top performer in the SemEval international shared task: Task 7 on Semantic Relation Extraction and Classification in Scientific Papers. The system ranked third out of 28 participants. The disputation of Farhad’s PhD thesis «Low-Resource Adaptation of Neural NLP Models» took place in October 2020. Click here to read more about this project.

Basil Ell develops approaches to align symbolic data (i. e., ontologies) with sub-symbolic data (e. g., texts or tables). The alignment enables labeled training data to be generated via distant supervision for approaches such as information extraction (IE) for ontology population or natural language generation (NLG). Having symbolic and sub-symbolic data aligned means obtaining hybrid data that can be processed by hybrid approaches. He received a best paper award at LDK 2021 – 3rd Conference on Language, Data and Knowledge, for his work on mining association rules that help to bridge between text and data [16]. Recently, he investigated the benefits of taking into account literals (i.e., unstructured data) for the prediction of links in knowledge graphs [17]. Furthermore, he develops statistical approaches that are applied to symbolic data (KGs) for the purposes of identifying regularities and anomalies, for the prediction of missing facts, for the evaluation of the structural plausibility of facts, for bridging between structured and unstructured data (as in IE, question answering, NLG), and the structural classification of regions within graphs (which is similar to sequence labeling, but on graphs).

Results

Selected Publications

[2] Erik B. Myklebust, Ernesto Jiménez-Ruiz, Jiaoyan Chen, Raoul Wolf, Knut Erik Tollefsen: «Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings.» Semantic Web journal (2021). [http://www.semantic-web-journal.net/system/files/swj2804.pdf]

[3] Magnus Holter and Basil Ell: «Towards Scope Detection in Textual Requirements.» 3rd Conference on Language, Data and Knowledge (LDK 2021). [https://drops.dagstuhl.de/opus/volltexte/2021/14567/pdf/OASIcs-LDK-2021-31.pdf]

[4] David Jaime Tena Cucala, Bernardo Cuenca Grau, Egor V. Kostylev, Boris Motik: «Explainable GNN-Based Models over Knowledge Graphs.» International Conference on Learning Representations (ICLR 2022). [https://openreview.net/pdf?id=CrCvGNHAIrz]

[5] Shuwen Liu, Bernardo Cuenca Grau, Ian Horrocks, Egor V. Kostylev: «INDIGO: GNN-based inductive knowledge graph completion using pair-wise encoding.» The 34th Annual Conference on Advances in Neural Information Processing (NeurIPS 2021). [https://proceedings.neurips.cc/paper/2021/file/0fd600c953cde8121262e322ef09f70e-Paper.pdf]

[6] Summaya Mumtaz and Martin Giese: «Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data.» AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering (1). 2020.

[7] Daniel Bakkelund: «Order preserving hierarchical agglomerative clustering.» Machine Learning (2021). [https://link.springer.com/content/pdf/10.1007/s10994-021-06125-0.pdf]

[8] Jiaoyan Chen, Pan Hu, Ernesto Jiménez-Ruiz, Ole Magnus Holter, Denvar Antonyrajah, Ian Horrocks: «OWL2Vec*: embedding of OWL ontologies.» Machine Learning 110.7 (2001): 1813-1845. [https://link.springer.com/content/pdf/10.1007/s10994-021-05997-6.pdf]

[9] Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, Denvar Antonyrajah, Ali Hadian, Jaehun Lee: «Augmenting Ontology Alignment by Semantic Embedding and Distant Supervision.» Extended Semantic Web Conference (2021): 392-408.

[10] Rasouli, Peyman, and Ingrid Chieh Yu. «CARE: Coherent actionable recourse based on sound counterfactual explanations.» International Journal of Data Science and Analytics (2022): 1-26. [https://link.springer.com/content/pdf/10.1007/s41060-022-00365-6.pdf]

[11] Yuxuan Shi, Gong Cheng, Trung-Kien Tran, Evgeny Kharlamov, Yulin Shen: «Efficient Computation of Semantically Cohesive Subgraphs for Keyword-Based Knowledge Graph Exploration.» WWW 2021: 1410-1421.

[12] Yuxuan Shi, Gong Cheng, Trung-Kien Tran, Jie Tang, Evgeny Kharlamov: «Keyword-Based Knowledge Graph Exploration Based on Quadratic Group Steiner Trees.» IJCAI 2021: 1555-1562. [https://www.ijcai.org/proceedings/2021/0215.pdf]

[13] Baifan Zhou, Yulia Svetashova, Andre Gusmao, Ahmet Soylu, Gong Cheng, Ralf Mikut, Arild Waaler, Evgeny Kharlamov: «SemML: Facilitating development of ML models for condition monitoring with semantics.» Journal of Web Semantics 71: 100664 (2021). [https://reader.elsevier.com/reader/sd/pii/S1570826821000391]

[14] Baifan Zhou, Dongzhuoran Zhou, Jieying Chen, Yulia Svetashova, Gong Cheng, Evgeny Kharlamov: «Scaling Usability of ML Analytics with Knowledge Graphs: Exemplified with A Bosch Welding Case.» IJCKG 2021: 54-63. [https://dl.acm.org/doi/pdf/10.1145/3502223.3502230]

[15] Dongzhuoran Zhou, Baifan Zhou, Jieying Chen, Gong Cheng, Egor V. Kostylev, Evgeny Kharlamov: «Towards Ontology Reshaping for KG Generation with User-in-the-Loop: Applied to Bosch Welding.» IJCKG 2021: 145-150.

[16] Basil Ell, Mohammad Fazleh Elahi, and Philipp Cimiano: «Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus.» 3rd Conference on Language, Data and Knowledge (LDK 2021). [https://drops.dagstuhl.de/opus/volltexte/2021/14569/pdf/OASIcs-LDK-2021-33.pdf]

[17] Moritz Blum, Basil Ell, and Philipp Cimiano: «Exploring the impact of literal transformations within knowledge graphs for link prediction.» In Proceedings of the 11th International Joint Conference on Knowledge Graphs (IJCKG 2022), 2022.

[18] Farhad Nooralahzadeh, Lilja Øvrelid, and Jan Tore Lønning: «SIRIUS-LTG-UiO at SemEval-2018 Task 7: Convolutional Neural Networks with Shortest Dependency Paths for Semantic Relation Extraction and Classification in Scientific Papers.» In Proceedings of the 12th International Workshop on Semantic Evaluation, pages 805–810, New Orleans, Louisiana. Association for Computational Linguistics. [https://aclanthology.org/S18-1128.pdf]

Team

Baifan Zhou, Basil Ell (program leader), Daniel Bakkelund, Egor V. Kostylev, Erik Bryhn Myklebust, Ernesto Jimenez-Ruiz, Evgeny Kharlamov, Evgenij Thorstensen, Farhad Nooralahzadeh, Gong Cheng, Ingrid Chieh Yu, Martin Giese, Maximilian Pflüger, Jiaoyan Chen, Ole Magnus Holter, Peyman Rasouli, Roxana Pop, Shuwen Liu, Summaya Mumtaz

Partners

External Partners
Samsung Research UK
The Alan Turing Institute
University of Lisbon
University of Malaga

Acknowledgements

This work was partially supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway, project 237889).

Domain-adapted data science group would like to thank all SIRIUS partners for their collaboration through various projects, including; Ahmed Soylu (Sintef), Anders Gjerver (Aibel), Atle Vesterkjær (numascale), Bobby Lindsey (TechnipFMC), Derek Smith (TechnipFMC), Elisabeth Nøst (TechnipFMC), Frode Myren (IBM), Gad-Elrab Mohamed (Bosch Center for AI), Jens Grimsgård (Equnior), Johan Wilhelm Kluewer (DNV), Kjetil Ellingsen (Equnior), Kjetil Fjalestad (Equinor), Klironomos Antonis (Bosch Center for AI), Knut Erik Tollefsen (NIVA), Knut J .Vidvei (TechnipFMC), Lene Amundsen (Equinor), Marcel Castro (TechnipFMC), Marco Bertani-Økland (Computas), Per Eivind Solum (Schlumberger), Pål Rylandsholm (DNV), Raoul Wolf (NIVA/NGI), Robert Aasheim (Equinor), Rogerio Abreu de Paula (IBM Research), Siddi Wouters (TechnipFMC), Stian F. Nordby (TechnipFMC), Tore Hartvigsen (DNV), Vegard Sangolt (Equnior), Vilde Nyrønning Strøm (DNV), Zhuoxun Zheng (Bosch Center for AI), Peter Elisø Nielsen(Equinor), Knut Sebastian Tungland(Equinor), Jennifer Sampson(Equinor),  Frode Myren(IBM), Lars Hovind(IBM), Tore Notland(IBM), Per Eivind Solum(Schlumberger)