Subsurface Data Analytics – Unstructured Data (Documents)

Low-Resource Adaptation of Neural NLP Models [PhD Project]

There is a growing interest in real-world applications of natural language processing (NLP) for extracting, summarizing, and analyzing textual data. While NLP methods have led to many breakthroughs in practical applications, most notably perhaps in machine translation, question answering, and natural language inference, it is still challenging to use NLP in many real-world scenarios. Since NLP relies heavily on supervised machine learning, the modeling of most NLP tasks requires large amounts of annotated data. These resources are often based on language data available in large quantities, such as English newswire. However, in NLP’s real-world applications, the textual resources may vary across several dimensions, such as language, dialect, topic, genre, etc. Considering the cross-product of these dimensions, it is difficult to find annotated data of sufficient amount and quality that spans all possible combinations and assists current advanced NLP techniques (Plank, 2016).

Challenges

In general, NLP application scenarios, can be classified into three categories according to their data resources (Duong, 2017): (i) High- or Rich-resource settings, where a large amount of annotated data is available; (ii) Low-resource or Resource-poor ones, where there is limited annotated data; and (iii) Zero-resource settings, where there is no annotated data available in the target context. Off-the- shelf resource-intensive NLP techniques tend to perform poorly where annotated data are not readily available (i.e., low-resource and zero-resource settings). An immediate solution is to create annotated data representative of new target scenarios. However, collecting and annotating corpora for each new variety requires experts and is usually expensive. Therefore, it is necessary to find techniques that can relieve the problem of creating training sets.

Our primary motivation in this thesis is based on the following argument in Plank (2016):

“If we embrace the variety of this heterogeneous data by combining it with proper algorithms, in addition to including text covariates/latent factors, we will not only produce more robust models, but will also enable adaptive language technology capable of addressing natural language variation.”

Our Approach

NLP for low-resource settings has recently received much attention, with dedicated workshops on the topic (Haffari et al., 2018; Cherry et al., 2019). In general, most previous work associates the low resource property with the language dimension (King, 2015; Tsvetkov, 2016; Duong, 2017; Kann et al., 2019). In this work, we follow Plank (2016) and consider the low-resource setting as fundamentally multi-dimensional, spanning over all kinds of variability within natural language, e.g., language, dialect, domain, genre. Therefore, the scope of this thesis is broader, and we explore how to adapt and improve the performance of NLP algorithms in a number of different low-resource settings, spanning across different domains, genres and languages and dealing with a number of central NLP tasks. We here make a distinction between domain, genre, and language. We call the variety aspect domain when the source dataset defers in terms of topic (Thesis: chapters 3, 4, and 5). The term Topic is the general subject of a document and ranging from very broad to more detailed such as oil and gas, biomedical, and e-commerce. Furthermore, we use the term genre, where the source dataset covers non-topical text properties such as function, style, and text type in Tesis:Chapter 6.

A number of approaches have been proposed to address the challenge of low-resource scenarios. They have significantly improved upon the state-of- the-art on a wide range of NLP tasks for various settings. In this thesis, we make use of adaptation techniques that fall into the following main paradigms:

(i) Distant Supervision: A supervised learning paradigm where the training data is not manually annotated, but automatically generated using knowledge bases (KBs) and heuristics (Mintz et al., 2009)

(ii) Transfer Learning: Techniques for leveraging data from additional domains, tasks or languages to train a model with better generalization properties (Ruder et al., 2019).

Real-world applications of NLP typically incorporate a number of more specialized, task-specific systems, e.g., pre-processing, various types of syntactic or semantic analysis, inference, etc. Here we focus mainly on NLP tasks from the areas of Information Extraction (specifically Named Entity Recognition and Relation Extraction) and Natural Language Understanding (more specifically Natural Language Inference and Question-Answering).

Research Questions

At a high level of abstraction, we attempt to answer the following main research questions in this PhD project:

Research Question I. What is the impact of different input representations in neural low-resource NLP?

The vector representations of tokens instantiate the distributional hypothesis by learning representations of the meaning of words, called embeddings, directly from text corpora. These representations are crucial elements in the performance of downstream NLP systems and underlie the more powerful and more recent contextualized word representations. We here study input representations trained on data from specific domains using sequential transfer learning of word embeddings. Concretely, we attempt to answer the following research questions:

(i) Can word embedding models capture domain-specific semantic relations even when trained with a considerably smaller corpus size?

(ii) Are domain-specific input representations beneficial in downstream NLP tasks?

In order to address these questions, we study input representations trained on data from a low resource domain (Oil and Gas). We conduct intrinsic and extrinsic evaluations of both general and domain-specific embeddings. Further, we investigate the effect of domain-specific word embeddings in the input layer of a downstream sentence classification task in the same domain. Domain-specific embeddings are further studied in the context of the relation extraction task on data from an unrelated genre and domain: scientific literature from the NLP domain.

In many NLP tasks, syntactic information is viewed as useful, and a variety of new approaches incorporate syntactic information in their underlying models. Within the context of this thesis, we hypothesize that syntax may provide a level of abstraction that can be beneficial when there is little available labeled data. We pursue this line of research particularly for low-resource relation extraction, and we look at the following question:

(iii) What is the impact of syntactic dependency representations in low-resource neural relation extraction?

We design a neural architecture over dependency paths combined with domain- specific word embeddings to extract and classify semantic relations in a low-resource domain. We explore the use of different syntactic dependency representations in a neural model and compare various dependency schemes. We further compare with a syntax-agnostic approach and perform an error analysis to gain a better understanding of the results.

Research Question II. How can we incorporate domain knowledge in low-resource NLP?

Technical domains often have knowledge resources that encode domain knowledge in a structured format. There is currently a line of research that tries to incorporate this knowledge encoded in domain resources in NLP systems. The domain knowledge can be leveraged either to provide weak supervision or to include additional information not available in text corpora to improve the model’s performance. Here, we explore this line of research in low-resource scenarios by addressing the following questions:

How can we take advantage of existing domain-specific knowledge resources to enhance our models?

We investigate the impact of domain knowledge resources in enhancing embedding models. We augment the domain-specific model by providing vector representations for infrequent and unseen technical terms using a domain knowledge resource and evaluate its impact by intrinsic and extrinsic evaluations.

Given the availability of domain-specific knowledge resources, distant supervision can be applied to generate automatically labeled training data in low-resource domains. In this thesis, we explore the use of distant supervision for low-resource Named Entity Recognition (NER) in various domains and languages. We here address the following question:

How can we address the problem of low-resource NER using distantly supervised data?

The outcome of distant supervision, however, is often noisy. To address this issue, we explore the following research question:

How can we exploit a reinforcement learning approach to improve NER in low-resource scenarios?

We present a system which addresses the problem of noisy, distantly supervised data using reinforcement learning and partial annotation learning.

Research Question III. How can we address the challenges of low-resource scenarios using transfer learning techniques?

Transfer learning has yielded significant improvements in various NLP tasks. The most dominant practice of transfer learning is to pre-train embedding representations on a large unlabeled text corpus and then to transfer these representations to a supervised target task using labeled data. We explore this idea, namely sequential transfer learning of word embeddings, in the first research question (i.e., RQ I above).

Further, we consider the transfer of models between two linguistic variants such as genre and language, when little (i.e., low-resource) or no data (i.e., zero-resource) is available for a target genre or language. We study this challenging setup in two natural language understanding tasks using meta- learning. Accordingly, we investigate the following research questions:

Can meta-learning assist us in coping with low-resource settings in natural language understanding (NLU) tasks?
What is the impact of meta-learning on the performance of pre-trained language models in cross-lingual NLU tasks?

Results

Publications

Nooralahzadeh, Farhad; Øvrelid, Lilja and Lønning, Jan Tore (2018). “Evaluation of Domain-specific Word Embeddings using Knowledge Resources.” In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA).

Nooralahzadeh, Farhad; Øvrelid, Lilja and Lønning, Jan Tore (2018). “SIRIUS-LTG-UiO at SemEval-2018 Task 7: Convolutional Neural Net- works with Shortest Dependency Paths for Semantic Relation Extraction and Classification in Scientific Papers.” In: Proceedings of the 12th Interna- tional Workshop on Semantic Evaluation. Association for Computational Linguistics.

Nooralahzadeh, Farhad and Øvrelid, Lilja (2018). “Syntactic Dependency Representations in Neural Relation Classification.” In: Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP. Association for Computational Linguistics.

Nooralahzadeh, Farhad; Lønning, Jan Tore and Øvrelid, Lilja (2019). “Reinforcement-based denoising of distantly supervised NER with partial annotation.” In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). Association for Computational Linguistics.

Nooralahzadeh, Farhad; Bekoulis, Giannis; Bjerva, Johannes; and Augenstein, Isabelle (2020). “Zero-shot cross-lingual transfer with meta learning.” In: CoRR vol. abs/2003.02739.

Nooralahzadeh, Farhad; Bekoulis, Giannis; Bjerva, Johannes; and Augenstein, Isabelle (2020). “Zero-shot cross-lingual transfer with meta learning.”In: CoRR vol. abs/2003.02739

Team

SIRIUS:

Farhad Nooralahzadeh[PhD Candidate],Lilja Øvrelid [Supervisor], Jan Tore Lønning[Supervisor]

Partners

Acknowledgements

This work was supported by the SIRIUS Centre for Scalable Data Access (Research Council of Norway, project 237889).

We would like to thank Jens Grimsgaard(Equinor), Peter Elisø Nielsen(Equinor), Knut Sebastian Tungland(Equinor), Jennifer Sampson(Equinor), Frode Myren(IBM), Lars Hovind(IBM), Tore Notland(IBM), Per Eivind Solum(Schlumberger) for the discussions and feedback on this work.

References

Plank, B. (2016). “What to do about non-standard (or non-canonical) language in NLP”. In: CoRR vol. abs/1608.07836.
Duong, L. (2017). “Natural language processing for resource-poor languages”. PhD thesis. University of Melbourne.
Haffari, R., Cherry, C., Foster, G., Khadivi, S., and Salehi, B. (2018). Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP.
Cherry, C., Durrett, G., Foster, G., Haffari, R., Khadivi, S., Peng, N., Ren, X., and Swayamdipta, S. (2019). Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019). Association for Computational Linguistics, Hong Kong, China.
King, B. P. (2015). “Practical Natural Language Processing for Low-Resource Languages”. PhD thesis. University of Melbourne.
Tsvetkov, Y. (2016). “Linguistic Knowledge in Data-Driven Natural Language Processing”. PhD thesis. Carnegie Mellon University.
Kann, K., Cho, K., and Bowman, S. R. (2019). “Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, pp. 3342–3349.
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). “Distant Supervision for Relation Extraction Without Labeled Data”. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 – Volume 2. Association for Computational Linguistics, pp. 1003–1011.
Ruder, S., Peters, M. E., Swayamdipta, S., and Wolf, T. (2019). “Transfer Learning in Natural Language Processing”. In: Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Tutorials. Association for Computational Linguistics, pp. 15–18.