At a high level of abstraction, we attempt to answer the following main research questions in this PhD project:
Research Question I. What is the impact of different input representations in neural low-resource NLP?
The vector representations of tokens instantiate the distributional hypothesis by learning representations of the meaning of words, called embeddings, directly from text corpora. These representations are crucial elements in the performance of downstream NLP systems and underlie the more powerful and more recent contextualized word representations. We here study input representations trained on data from specific domains using sequential transfer learning of word embeddings. Concretely, we attempt to answer the following research questions:
(i) Can word embedding models capture domain-specific semantic relations even when trained with a considerably smaller corpus size?
(ii) Are domain-specific input representations beneficial in downstream NLP tasks?
In order to address these questions, we study input representations trained on data from a low resource domain (Oil and Gas). We conduct intrinsic and extrinsic evaluations of both general and domain-specific embeddings. Further, we investigate the effect of domain-specific word embeddings in the input layer of a downstream sentence classification task in the same domain. Domain-specific embeddings are further studied in the context of the relation extraction task on data from an unrelated genre and domain: scientific literature from the NLP domain.
In many NLP tasks, syntactic information is viewed as useful, and a variety of new approaches incorporate syntactic information in their underlying models. Within the context of this thesis, we hypothesize that syntax may provide a level of abstraction that can be beneficial when there is little available labeled data. We pursue this line of research particularly for low-resource relation extraction, and we look at the following question:
(iii) What is the impact of syntactic dependency representations in low-resource neural relation extraction?
We design a neural architecture over dependency paths combined with domain- specific word embeddings to extract and classify semantic relations in a low-resource domain. We explore the use of different syntactic dependency representations in a neural model and compare various dependency schemes. We further compare with a syntax-agnostic approach and perform an error analysis to gain a better understanding of the results.
Research Question II. How can we incorporate domain knowledge in low-resource NLP?
Technical domains often have knowledge resources that encode domain knowledge in a structured format. There is currently a line of research that tries to incorporate this knowledge encoded in domain resources in NLP systems. The domain knowledge can be leveraged either to provide weak supervision or to include additional information not available in text corpora to improve the model’s performance. Here, we explore this line of research in low-resource scenarios by addressing the following questions:
- How can we take advantage of existing domain-specific knowledge resources to enhance our models?
We investigate the impact of domain knowledge resources in enhancing embedding models. We augment the domain-specific model by providing vector representations for infrequent and unseen technical terms using a domain knowledge resource and evaluate its impact by intrinsic and extrinsic evaluations.
Given the availability of domain-specific knowledge resources, distant supervision can be applied to generate automatically labeled training data in low-resource domains. In this thesis, we explore the use of distant supervision for low-resource Named Entity Recognition (NER) in various domains and languages. We here address the following question:
- How can we address the problem of low-resource NER using distantly supervised data?
The outcome of distant supervision, however, is often noisy. To address this issue, we explore the following research question:
- How can we exploit a reinforcement learning approach to improve NER in low-resource scenarios?
We present a system which addresses the problem of noisy, distantly supervised data using reinforcement learning and partial annotation learning.
Research Question III. How can we address the challenges of low-resource scenarios using transfer learning techniques?
Transfer learning has yielded significant improvements in various NLP tasks. The most dominant practice of transfer learning is to pre-train embedding representations on a large unlabeled text corpus and then to transfer these representations to a supervised target task using labeled data. We explore this idea, namely sequential transfer learning of word embeddings, in the first research question (i.e., RQ I above).
Further, we consider the transfer of models between two linguistic variants such as genre and language, when little (i.e., low-resource) or no data (i.e., zero-resource) is available for a target genre or language. We study this challenging setup in two natural language understanding tasks using meta- learning. Accordingly, we investigate the following research questions:
- Can meta-learning assist us in coping with low-resource settings in natural language understanding (NLU) tasks?
- What is the impact of meta-learning on the performance of pre-trained language models in cross-lingual NLU tasks?