Myklebust et. al. 2019 introduced TERA within the use case of effect prediction. The sources used in version one of TERA are NCBI (Taxonomy; Sayers et al. 2008), ChEBI (chemicals; Hastings, Owen, et al. 2016, PubChem (chemical similarity; Kim, J. Chen, et al. 2018), and ECOTOX (effect data; Olker et al. 2022). In addition, the work used LogMap (Jiménez-Ruiz and Cuenca Grau 2011; Jiménez-Ruiz, Cuenca Grau, Zhou, et al. 2012) and Wikidata (Vrandecic and Krötzsch 2014) to align the disparate sources.
Myklebust et. al. 2019 showed that the symbolic baseline model is not suited for the effect prediction problem, lacking 20% behind the second baseline in terms of F1-score (harmonic mean of precision and recall). The embedding based models improved slightly over the neural network baseline in terms of F1-score, however, we saw larger improvements for F2-score (recall is weighted twice that of precision) indicating that the embedding based models are able to catch more lethal effects.
Myklebust et. al. 2022a extend TERA with further resources, the Encyclopedia of Life (EOL), MeSH, and ChEMBL, and made further use of ontology matching (OM) tools (adding AML for alignment) and Wikidata to align them more efficiently.
Myklebust et. al. 2022a introduced a fine-tuning model architecture which enabled them to initialize a model with pre-trained KGEs and finetune them to fit to the task at hand.
Myklebust et. al. 2022a found that with parameter tuning the (one-hot neural network) baseline performs equal to KGEs for the simplest validation strategy (unknown pairs). Moreover, the gap between the baseline and KGEs increased with increasing difficulty, unknown species, then unknown chemicals, and finally, with both unknown. They found that the finetuning architecture improved results in the difficult settings (chemical and species unknown) up to 10% over non-finetuned KGEs.
Finally, Myklebust et. al. 2022a analyzed the trends of which KGEMs perform well in all settings. They found that ComplEx performed well in all scenarios, albeit being surpassed in a few cases.
Myklebust et. al. 2022b moved away from a classification problem to a regression one, predicting chemical concentrations causing 50% mortality. The importance of this is apparent as the number of quality (and trustworthy) data points are reduced tenfold from Myklebust et. al. 2022a.
Myklebust et. al. 2022b showed that introducing KGE for regression effect prediction improved results by up to 40% depending on the data splits over a similar baseline as in Myklebust et. al. 2022a.
To emphasize the usefulness of KGE in this task, Myklebust et. al. 2022b developed qualitative and quantitative methods to gain insight. First, Myklebust et. al. 2022b explored the entities and their relations, which are similar to prediction entities. This can give a domain expert the possibility to validate whether the model has sufficient knowledge to make an educated prediction. Second, two quantitative methods where developed, both based on the KG density in the neighborhood of prediction entities.