Cross-lingual Representation Learning for Natural Language Processing
Published in UCLA Electronic Theses and Dissertations, 2021
In the modern era of deep learning, developing natural language processing (NLP) systems require large-scale annotated data. However, it is unfortunate that most large-scale labeled datasets are only available in a handful of languages; for the vast majority of languages, either a few or no annotations are available to empower automated NLP applications. Hence, one of the focuses of cross-lingual NLP research is to develop computational approaches by leveraging resource-rich language corpora and utilize them in low-resource language applications via transferable representation learning. Cross-lingual representation learning has emerged as an indispensable ingredient for cross-lingual natural language understanding that learns to embed notions, such as meanings of words, how the words are combined to form a concept, etc., in shared representation space. In recent years, crosslingual representation learning and transfer learning together have redefined low-resource NLP and enabled us to build models for a broad spectrum of languages.
This dissertation discusses the fundamental challenges and proposes several approaches for cross-lingual representation learning that (1) utilize universal syntactic dependencies to bridge the typological differences across languages and (2) effectively use unlabeled resources to learn robust and generalizable representations. The proposed approaches effectively transfer across a wide range of languages across different NLP applications, including dependency parsing, named entity recognition, text classification, question answering, and more.