Publications
Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data
A peer-reviewed workshop paper published in The Proceedings of the AAAI- Make Spring Symposium 2020: Combining Machine Learning and Knowledge Engineering in Practice.
Authors: Summaya Mumtaz & Martin Giese
Summary: This paper defines two new similarity measures that consider semantic information to calculate the similarity between two categorical values. Semantic information is represented in the form of a domain hierarchy or taxonomy. The first measure calculates semantic similarity by combining the data-driven approach with the hierarchy. The second measure ignores the data and uses only the hierarchy to calculate semantic similarity. We apply our methods to a specific complex data mining task in the oil and gas industry: reservoir analogue identification. The two proposed measures are compared to existing data-based measures.
Data-based Support for Petroleum Prospect Evaluation
A peer-reviewed journal article published in Earth Science Informatics, Volume 13,2020.
Authors: Summaya Mumtaz, Irina Pene, Adnan Latif & Martin Giese
Summary: This paper evaluates the commercial viability of hydrocarbon prospects based on limited information and in limited time and is an extension of the first paper. In this paper, purely data-driven machine learning approaches are used to predict key reservoir parameters. The poor prediction performance of data-driven approaches leads us to investigate the idea to use the same data to produce a limited list of potentially similar well-explored reservoirs (known as analogues) that can support the prospect evaluation work of human geoscientists. This article takes up the approach of paper 1 and demonstrates its applicability in this use case where the direct prediction of parameters
Hierarchy-based Semantic Embeddings for Single-valued & Multi-valued Categorical Variables
Submitted to the Journal of Intelligent Information Systems.
Authors: Summaya Mumtaz & Martin Giese
Summary: We define two embedding schemes for single-valued and multi- valued categorical data using semantic similarity based on poly-hierarchies. As part of our experiments, we propose a similarity measure by extending the previously proposed similarity measure to also support poly-hierarchies and unbalanced hierarchies. First, we compare the performance of various similarity measures (information-content and structure-based measures) on word pair datasets using WordNet hierarchy. Purely structural measures achieve stable performance while information-content based measures depend on the data quality. We define word embeddings based on the hierarchy and using various similarity measures to create word embeddings. Evaluation on concept categorization task shows better performance as compared to the existing word embedding models (Google and Glove). The evaluation conforms to the results of the word pair task and shows that by using only hierarchy-based measures, we can achieve good performance for the clustering task. We also perform a comparison of semantic-based and standard binary encoding on the MIMIC mortality prediction use case. The results show a major improvement in downstream classification tasks’ performance by using semantic information.