Semantic Support for Water Applications in NIVA

Ecotoxicological Effect Prediction using a Tailored Knowledge Graph [PhD Project]

There is great potential in background knowledge to inform prediction tasks. Relevant data sources might be disparate and require extensive work to unify. This work unified several data sources relevant to ecotoxicology and used these as background knowledge to solve a prediction task; namely, biological effect prediction.

Challenges

The large disparate data sources tackled in this thesis needs to be unified and using Semantic Web Technologies (SWT) and knowledge graphs (KGs) is an increasingly popular direction for doing so. Automated tools exist for partially tackling the integration; however, this still requires manual curation to increase coverage and exactness.

Artificial intelligence (AI) and the subfield machine learning (ML) are large, fast moving, research fields. However, the majority of research is conducted on benchmark datasets and applications on real data is limited. This thesis provides several real world applications within the biological effect prediction domain based on existing real, noisy data. The use of integrated data sources and SWT in ML models has the potential to improve predictions on these noisy data.

Our Approach

Hypothesis

Based on the motivation we formulate a hypothesis that encapsulates the objectives of the thesis.

The integration of disparate data sources in ecotoxicological research will aid data access and a prediction task, namely biological effect prediction, through the use of knowledge graph embedding models.

The majority of technical domains have existing knowledge resources that keep domain information in a structured format. There is currently a line of research in various disciplines under the umbrella of AI that tries to incorporate structured domain knowledge into data-based AI approaches. Typical examples are natural language processing and semantic technologies. In these systems, domain knowledge is leveraged to add additional information not available in the training dataset. The end goal is to increase the performance of the ML system. However, in mainstream machine learning with tabular datasets, little work is being done in this area. The core methods for converting the categorical data to numeric representations in ML still lack semantics or background knowledge of the domain. In this PhD project, we aim at utilizing the existing knowledge resources in complex domains to define semantic similarities between categorical data. We use the semantic similarity to define vector representations of categorical variables.

Objectives

Objective 1. Identify relevant sources within the domain. Not all sources used in current ecological risk assessment pipelines may be relevant in the described use case. Moreover, the use of SWT and Linked Open Data (LOD) effectively unbounds the data sources available which is not practical in this use case.

Objective 2. Integrate the sources, including creating or gathering mappings between them. The use of existing mappings is preferred; however, these are not always available and mapping tools need to be applied.

Objective 3. Create several validation strategies for the prediction task. This is often an overlooked task and results are frequently skewed by improper validation strategies.

Objective 4. Apply KGE within the prediction task to identify suited KGEMs. The use of KGEMs has exploded recently; however, the applications tested with these models are fairly limited and exploring an extensive catalog of models is important for the validation of this work.

Objective 5. Develop a novel prediction model based on the idea of fine-tuning embeddings. In contrast to (most) other work with KGEMs, the prediction task presented in this thesis is an out-of-the-KG task; therefore, novel task-specific tuning methods can be explored.

Objective 6. Use the KG (and embeddings) properties to provide quantitative and qualitative explanations for predictions. As ecotoxicological predictions are usually made using few pieces of data (Chary, Boyer, and Burns 2021); therefore, these methods are necessary to increase confidence in the methods. In cases where models are both black boxes and uncertain, methods to interpret these models need to be developed. The use of KGs to aid in this process is an emerging research field.

Contributions

The scientific contributions of this PhD project are listed below:

  • The Toxicological and Risk Assessment Knowledge Graph (TERA) integrates the highly valued data sources in the ecotoxicological domain. This integration required a large amount of manual annotation and transformation. Moreover, the use of state-of-the-art alignment tools and external sources was essential. This KG serves as the backbone of the thesis and is essential for the consequent contributions.
  • Popular KGEMs are evaluated in the ecotoxicological effect prediction ranking task with four sampling strategies representing different unknown aspects of the prediction.
  • A fine-tuning method is developed to work in conjunction with the standard KGEMs to tailor the KGEs to the prediction task.
  • The prediction task is expanded from ranking (binary) to the prediction of effect concentrations which proves the ability of background knowledge to contribute to the expansion of application domain of effect predicting models.
  • Methods to gain insight into the predictions are created based on both the symbolic and the latent representation of TERA. These methods can be interpreted by domain experts (during development) or end users of the prediction models.

Results

Publications​

Myklebust et. al. 2019 introduced TERA within the use case of effect prediction. The sources used in version one of TERA are NCBI (Taxonomy; Sayers et al. 2008), ChEBI (chemicals; Hastings, Owen, et al. 2016, PubChem (chemical similarity; Kim, J. Chen, et al. 2018), and ECOTOX (effect data; Olker et al. 2022). In addition, the work used LogMap (Jiménez-Ruiz and Cuenca Grau 2011; Jiménez-Ruiz, Cuenca Grau, Zhou, et al. 2012) and Wikidata (Vrandecic and Krötzsch 2014) to align the disparate sources.

Myklebust et. al. 2019 showed that the symbolic baseline model is not suited for the effect prediction problem, lacking 20% behind the second baseline in terms of F1-score (harmonic mean of precision and recall). The embedding based models improved slightly over the neural network baseline in terms of F1-score, however, we saw larger improvements for F2-score (recall is weighted twice that of precision) indicating that the embedding based models are able to catch more lethal effects.

Myklebust et. al. 2022a extend TERA with further resources, the Encyclopedia of Life (EOL), MeSH, and ChEMBL, and made further use of ontology matching (OM) tools (adding AML for alignment) and Wikidata to align them more efficiently.

Myklebust et. al. 2022a introduced a fine-tuning model architecture which enabled them to initialize a model with pre-trained KGEs and finetune them to fit to the task at hand.

Myklebust et. al. 2022a found that with parameter tuning the (one-hot neural network) baseline performs equal to KGEs for the simplest validation strategy (unknown pairs). Moreover, the gap between the baseline and KGEs increased with increasing difficulty, unknown species, then unknown chemicals, and finally, with both unknown. They found that the finetuning architecture improved results in the difficult settings (chemical and species unknown) up to 10% over non-finetuned KGEs.

Finally, Myklebust et. al. 2022a analyzed the trends of which KGEMs perform well in all settings. They found that ComplEx performed well in all scenarios, albeit being surpassed in a few cases.

Myklebust et. al. 2022b moved away from a classification problem to a regression one, predicting chemical concentrations causing 50% mortality. The importance of this is apparent as the number of quality (and trustworthy) data points are reduced tenfold from Myklebust et. al. 2022a.

Myklebust et. al. 2022b showed that introducing KGE for regression effect prediction improved results by up to 40% depending on the data splits over a similar baseline as in Myklebust et. al. 2022a.

To emphasize the usefulness of KGE in this task, Myklebust et. al. 2022b developed qualitative and quantitative methods to gain insight. First, Myklebust et. al. 2022b explored the entities and their relations, which are similar to prediction entities. This can give a domain expert the possibility to validate whether the model has sufficient knowledge to make an educated prediction. Second, two quantitative methods where developed, both based on the KG density in the neighborhood of prediction entities.

Team

SIRIUS:

Erik Bryhn Myklebust[PhD Candidate], Ernesto Jimenez-Ruiz [Supervisor], Jiaoyan Chen [Co-supervisor]

Partners

Acknowledgements

This work is supported by the grant 272414 from the Research Council of Norway (RCN), the MixRisk project (Research Council of Norway, project 268294), SIRIUS Centre for Scalable Data Access (Research Council of Norway, project 237889), Samsung Research UK, Siemens AG, and the EPSRC projects AnaLOG (EP/P025943/1), OASIS (EP/S032347/1), UK FIRES (EP/S019111/1) and the AIDA project (Alan Turing Institute).

References

  • Erik Bryhn Myklebust, Ernesto Jimenez-Ruiz, Jiaoyan Chen, Raoul Wolf, Knut Erik Tollefsen. 2019. Knowledge Graph Embedding for Ecotoxicological Effect Prediction. In: Ghidini C. et al. (eds) The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science, vol 11779. Springer, Cham
  • Erik B Myklebust, Ernesto Jiménez-Ruiz, Jiaoyan Chen, Raoul Wolf, Knut Erik Tollefse. 2022a. Prediction of adverse biological effects of chemicals using knowledge graph embeddings. Semantic Web, vol. 13, no. 3, pp. 299-338, 2022
  • Erik Bryhn Myklebust, Ernesto Jimenez-Ruiz, Jiaoyan Chen, Raoul Wolf, Knut Erik Tollefsen. 2022b. Understanding Adverse Biological Effect Predictions Using Knowledge Graphs. 10.48550/ARXIV.2210.15985
  • Chary M, Boyer EW, Burns MM. Diagnosis of Acute Poisoning using explainable artificial intelligence. Comput Biol Med. 2021 Jul;134:104469. doi: 10.1016/j.compbiomed.2021.104469. Epub 2021 May 13. PMID: 34022488.
  • Jiménez-Ruiz, E. and Cuenca Grau, B. (2011). “LogMap: Logic-Based and Scalable Ontology Matching”. In: 10th International Semantic Web Conference, pp. 273–288.
  • Jiménez-Ruiz, E., Cuenca Grau, B., Zhou, Y., et al. (2012). “Large-scale Interactive Ontology Matching: Algorithms and Implementation”. In: the 20th European Conference on Artificial Intelligence (ECAI). Montpellier, France: IOS Press, pp. 444–449.
  • Sayers, E. W. et al. (Oct. 2008). “Database resources of the National Center for Biotechnology Information”. In: Nucleic Acids Research vol. 37, no. suppl_1, pp. D5–D15. issn: 0305-1048.
  • Hastings, J., Owen, G., et al. (2016). “ChEBI in 2016: Improved services and an expanding collection of metabolites”. In: Nucleic acids research vol. 44, no. D1, pp. 214–9.
  • Kim, S., Chen, J., et al. (Oct. 2018). “PubChem 2019 update: improved access to chemical data”. In: Nucleic Acids Research vol. 47, no. D1, pp. D1102–D1109.
  • Olker, J. H. et al. (2022). “The ECOTOXicology Knowledgebase: A Curated Database of Ecologically Relevant Toxicity Tests to Support Environmental Research and Risk Assessment”. In: Environmental Toxicology and Chemistry vol. 41, no. 6, pp. 1520–1539. doi: https://doi.org/10.1002/etc.5324. eprint: https:// onlinelibrary. wiley. com/ doi/ pdf/ 10. 1002/ etc. 5324.