Subsurface Data Analytics – Structured Data

Hierarchy-based Similarity Measures and Embeddings [PhD Project]

Supporting Machine Learning by Knowledge

Interpretation of the subsurface in order to find out where hydrocarbons are located is a challenging task for explorationists. They need to be creative and come up with innovative ideas when defining and assessing new prospects, especially nowa- days when the easy to find, big fields have been already discovered. The challenges related to prospect assessment are (1) the geodata is uncertain, intermittent, sparse, multiresolution, and multi-scale, and (2) the explorationists often limit themselves to assess few possible scenarios  

Recent advancements in computation, network and storage have led to numerous opportunities to improve these subsurface evaluation workflows. Further, the volatility and uncertainty in the oil and gas industry have forced exploration and production companies to find improved and cost-effective solutions by automating thesse workflows. 


Since machine learning predictions mostly rely on supervised learning methods, the modeling of such ML tasks requires large amounts of labeled data. However, many real-world applications of ML are low-resource domains. The data in these domains is often sparse and scattered in many dimensions, such as reports, databases, logs, unstructured files, etc. Considering all these dimensions’ cross-product, it is often difficult to find a sufficient amount of quality and labeled data to assist advanced ML tasks.


Low-resource domains are inherently complex, and the trails to increase data in such domains are usually laborious, time-consuming, highly expensive, and involve ethical issues [Zha+18]. It may not be possible to increase the sample size in some events, for instance, in natural disasters: tsunamis, tornadoes, floods, volcanic eruptions, earthquakes, etc. Even if it is possible to increase the sample size, there is no guarantee of increasing the given dataset’s in-sample variation. Most of the variables in nature follow the power law, where some points occur frequently, and others are infrequent. Increasing the sample size would mostly increase the data size with the same set of values present in the data.


From training to model validation, data quality plays an important role, and it has a significant impact on the predictive power of the derived models. There can be many factors that lead to low data quality in low-resource domains. One crucial factor is feature sparsity, meaning that the number of active features in the training data is extremely low relative to the feature dimensions. This leads to a performance loss of many machine learning models [WB17]. Another critical issue is the complex relationships between different features. Due to the complex processes and relationships in the data, some values occurring in one dimension can appear with all other values in other dimensions, making it difficult to use statistical processing to extract some trends directly.

In addition to the issues mentioned above, another critical factor in low- resource domains is high cardinality categorical variables in the data. The number of unique categories occurring in these datasets is high. Analyzing high cardinality categorical data poses its own set of challenges as most ML models are designed only for numeric data. The inability to determine the order of categories together with high cardinality makes the problem much more challenging.

Our Approach

Since it is not possible to increase the data set size or quality in low-resource domains, we narrow down our scope to handle categorical variables in low- resource domains. Various work attempts to handle categorical data in purely data-based ways [DSP11; Esk+02; GB16]. For the lack of other information, these methods rely on the occurrence frequency of categorical values. However, the similarity between two discrete concepts (categories) in a domain is not dependent on the dataset statistics such as frequency, size, or data dimensions. For instance, if we consider chemicals in a toxicology system, then the similarity between organic and inorganic chemicals is not dependent on their occurrence frequency. The similarity or dissimilarity between various sets of chemicals depends on their structural composition or specific properties. Our work’s scope is to use knowledge resources outside the particular data set to handle categorical data.


The majority of technical domains have existing knowledge resources that keep domain information in a structured format. There is currently a line of research in various disciplines under the umbrella of AI that tries to incorporate structured domain knowledge into data-based AI approaches. Typical examples are natural language processing and semantic technologies. In these systems, domain knowledge is leveraged to add additional information not available in the training dataset. The end goal is to increase the performance of the ML system. However, in mainstream machine learning with tabular datasets, little work is being done in this area. The core methods for converting the categorical data to numeric representations in ML still lack semantics or background knowledge of the domain. In this PhD project, we aim at utilizing the existing knowledge resources in complex domains to define semantic similarities between categorical data. We use the semantic similarity to define vector representations of categorical variables.

Research Questions

At a high level of abstraction, we attempt to answer the following main research questions in this PhD project:

Research Question 1: Is it possible to add existing domain knowledge in the Machine learning process for low-resource domains?
Many technical domains now have knowledge sources that keep domain information in a structured format. This leads to an active research area that tries to integrate and utilize this knowledge efficiently. The most active areas in this research are natural language processing and semantic technologies. Concretely, we attempt to answer the following research questions:

  • How can we use domain knowledge to provide additional information that is not directly available in the training data for low-resource domains?
  • What is the impact of adding domain knowledge in enhancing ML performance?


We augment the model by adding vector representation of categorical variables using domain knowledge and evaluate its impact in various real-life use-cases.

Research Question 2: How can we use existing domain knowledge to define the semantic similarity between different values of a categorical variable? Numerical variables are naturally equipped with a metric, and that is crucial for ML methods’ ability to interpolate. By default, similar input will lead to similar output. Adding a sense of similarity or dissimilarity is an important step towards using them in ML. The discrete nature of categorical data makes it difficult to quantify the similarity between two different categories statistically. Often, in complex domains, the terms occurring in the categorical data are technical and domain-specific. A simple statistical approach does not quantify the actual similarity between these terms. We aim to map these values back to existing domain knowledge and devise a way to quantify similarity representing the domain’s semantics.

Research Question 3: How to define vector embeddings for single-valued and multi-valued categorical variables by preserving semantic similarity? The input requirement of the majority of the machine learning models is a numeric representation. The existing statistical models convert categories to numeric without considering domain information. We explore the idea of using semantic similarity based on domain knowledge for embedding categorical data. We investigate strategies for handling both single-valued and multi-valued data.


The scientific contributions of this PhD project are listed below:


Contribution 1: We have investigated RQ1 for the important special case where the domain knowledge is a taxonomy. We use structured domain knowledge in three different domains: reservoir analogue identification in the oil and gas domain, mortality prediction in the biomedical domain, and word embeddings in natural language processing. The experimental results in these three use-cases show that the addition of domain knowledge in the form of hierarchy improves the performance of downstream machine learning tasks for these low-resource scenarios.


Contribution 2: As one example of a low-resource scenario, we consider a recommendation task in the oil and gas industry, based on a reservoir data set. We use a semantic similarity measure for categorical variables using existing domain knowledge in hydrocarbon exploration. The domain knowledge is extracted in the form of a mono-hierarchy. The approach only considers the hierarchy to define the semantic similarity between two discrete categories and ignores the data. We compare the approach with existing data-based similarity measures (OF, Lin, etc.). The results verify that for the given low-resource use case, the purely hierarchy-based measure performs better as compared to the existing data-based measures.


Contribution 3: We have performed a number of experiments to calculate semantic similarity by applying a variety of structural measures (only using the hierarchical structure) and information-content-based measures such as Lin, Resnik, etc. on different word pair similarity datasets in combination with WordNet hierarchy. As part of the experiments, we introduce a new similarity measure based on poly-hierarchy only. We have shown through experiments that given a large corpus and a good estimate of IC, the IC-based measures perform well. In cases without sufficient data to estimate the IC well (low-resource scenario), measures that do not use IC perform much better than the IC-based measures (Resnik and Lin). Our poly-hierarchy based measure has performance comparable to and in some cases superior to existing poly-hierarchy measures.


Contribution 4: We have proposed two semantic-based encoding techniques for single-valued and multi-valued categorical data. We evaluate the proposed strategies for two ML tasks: clustering and prediction. For the clustering, we consider the task of concept categorization in NLP. First, we provide a method for defining word embeddings by utilizing only the WordNet hierarchy and existing semantic-based similarity measures both IC-based and only hierarchy-based. The comparison with existing corpus- based methods (Word2Vec and GloVe) showed that the semantic similarity- based embedding strategy achieves better results. The results also conform with contribution 3, showing that structure-based semantic similarity measures achieve better performance if IC is not accurate, otherwise comparable. For the prediction task, we consider the task of mortality prediction using patients’ diagnoses data and domain knowledge about diagnoses. The results show improved performance as compared to the standard one-hot encoding.




Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data

A peer-reviewed workshop paper published in The Proceedings of the AAAI- Make Spring Symposium 2020: Combining Machine Learning and Knowledge Engineering in Practice.

Authors: Summaya Mumtaz & Martin Giese

Summary: This paper defines two new similarity measures that consider semantic information to calculate the similarity between two categorical values. Semantic information is represented in the form of a domain hierarchy or taxonomy. The first measure calculates semantic similarity by combining the data-driven approach with the hierarchy. The second measure ignores the data and uses only the hierarchy to calculate semantic similarity. We apply our methods to a specific complex data mining task in the oil and gas industry: reservoir analogue identification. The two proposed measures are compared to existing data-based measures.


Data-based Support for Petroleum Prospect Evaluation

A peer-reviewed journal article published in Earth Science Informatics, Volume 13,2020.

Authors: Summaya Mumtaz, Irina Pene, Adnan Latif & Martin Giese 

Summary: This paper evaluates the commercial viability of hydrocarbon prospects based on limited information and in limited time and is an extension of the first paper. In this paper, purely data-driven machine learning approaches are used to predict key reservoir parameters. The poor prediction performance  of data-driven approaches leads us to investigate the idea to use the same data to produce a limited list of potentially similar well-explored reservoirs (known as analogues) that can support the prospect evaluation work of human geoscientists. This article takes up the approach of paper 1 and demonstrates its applicability in this use case where the direct prediction of parameters


Hierarchy-based Semantic Embeddings for Single-valued & Multi-valued Categorical Variables

Submitted to the Journal of Intelligent Information Systems.

Authors: Summaya Mumtaz & Martin Giese

Summary: We define two embedding schemes for single-valued and multi- valued categorical data using semantic similarity based on poly-hierarchies. As part of our experiments, we propose a similarity measure by extending the previously proposed similarity measure to also support poly-hierarchies and unbalanced hierarchies. First, we compare the performance of various similarity measures (information-content and structure-based measures) on word pair datasets using WordNet hierarchy. Purely structural measures achieve stable performance while information-content based measures depend on the data quality. We define word embeddings based on the hierarchy and using various similarity measures to create word embeddings. Evaluation on concept categorization task shows better performance as compared to the existing word embedding models (Google and Glove). The evaluation conforms to the results of the word pair task and shows that by using only hierarchy-based measures, we can achieve good performance for the clustering task. We also perform a comparison of semantic-based and standard binary encoding on the MIMIC mortality prediction use case. The results show a major improvement in downstream classification tasks’ performance by using semantic information.



Summaya Mumtaz [PhD Candidate],Martin Geise [Supervisor], Christos Dimitrakakis [Co-supervisor], Adnan Latif [Colaborator], Irina Pene [Colaborator]



This work was supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway, project 237889).

We would like to thanks Jens Grimsgaard, Vegard Sangolt and their teams at Equinor Norway to help us access the Oil & Gas data and discussion on the reservoir analogue use-case and IHS Markit for providing us the relevant data.

  • [DSP11] Desai, A., Singh, H., and Pudi, V. “DISC: Data-Intensive Simi- larity Measure for Categorical Data”. In: Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2011.
  • [Esk+02]  Eskin, E. et al. “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data”. In: Applications of Data Mining in Computer Security vol. 6 (Feb. 2002).
  • [GB16] Guo, C. and Berkhahn, F. “Entity Embeddings of Categorical Variables”. In: ArXiv vol. abs/1604.06737 (2016).
  • [Jia+19] Jia, Z. et al. “Using the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity”. In: BMC Medical Informatics and Decision Making vol. 19 (May 2019), p. 91.
  • [Myk+19] Myklebust, E. B. et al. “Knowledge Graph Embedding for Ecotoxi- cological Effect Prediction”. In: The Semantic Web – ISWC 2019. Springer International Publishing, 2019, pp. 490–506.
  • [PAC18] Portugal, I., Alencar, P., and Cowan, D. “The use of machine learning algorithms in recommender systems: A systematic review”. In: Expert Systems with Applications vol. 97 (2018), pp. 205–227.
  • [WW18]  Wärnling, O. and Bissmark, J. “The Sparse Data Problem Within Classification Algorithms”. BA thesis. KTH Royal Institute of Technology, 2017.
  • [WB17]  Wu, Y. and Wang, G. “Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis”. In: International Journal of Molecular Sciences vol. 19, no. 8 (2018).
  • [Zha+18] Zhang, L. et al. “Applications of Machine Learning Methods in Drug Toxicity Prediction”. In: Current Topics in Medicinal Chemistry vol. 18 (July 2018).