Developing computational algorithms that (1) reduce the need for large-scale annotated data to train NLP models from scratch; (2) adapt to new domains and languages with fewer labeled examples.
Low-Resource Natural Language Processing
In today’s world, the number of speakers for some languages is in billions, while for many languages, it is only a few thousands. Due to this difference, most natural language processing (NLP) research is focusing on resource-rich languages that offer large-scale data. Similarly, there are several domains, such as the medical domain, where collecting a sufficient amount of annotated data is not always feasible as it requires expert annotators. My primary research effort strikes to design effective computational approaches that (1) reduce the need for large-scale annotated data to train NLP models from scratch; (2) adapt to new domains and languages with fewer labeled examples. Learning universal language representations utilizing data from multiple sources, designing new learning objective to bridge the gap between different learning signals, developing flexible model architecture to enable cross-domain and cross-language transfer is the fundamental objective of my research. Through my work, I address a few salient research questions, such as (1) how domain- or language-agnostic representations can be learned leveraging resource-rich corpora?, and (2) how such representations can be fine-tuned for low-resource NLP applications?
My prior works have the following research findings.
- A family of neural networks that is flexible in modeling “word order” (a language property) is more effective for cross-language transfer.
- Representations learned leveraging multiple text classification corpora can be advantageous for out-of-domain text classification tasks with fewer labeled examples.
I conduct interdisciplinary research and developed models for Biomedical NLP and information retrieval (IR) applications. For example, I developed models to represent users’ latent search intents by modeling search activities and exploiting supervised signals from companion retrieval tasks that require less data than the state-of-the-art methods. In my future work, I want to further investigate how generic language representations can be learned by aggregating data from multiple tasks and across languages to facilitate low-resource applications.