Natural Language Processing
Natural language processing is a subfield of artificial intelligence that is cross-disciplined with linguistics that deals with processing and understanding human spoken speech and texts. Written and spoken texts often have grammar and spelling mistakes; words typically have different syntatical forms. Depending on the tasks we are trying to solve we may opt to capture punctuations or ignore them. Preprocessing is common to clean the text and extract known tokens. Words are typically normalized to their root form via stemming.
A key consideration is how words are represented in models. One could use one-hot encoding. The weaknesses of one-hot encoding are many: your vector size will continue to grow as you add more words and you do not capture any meanings with that representation. Modern day encoding typically uses word embeddings with fixed size vector trained over large volumes of text. After training with large corpuses of texts, these dense representations capture features that reveal interesting relationships between words that can be leveraged for various NLP tasks and purposes.
Classic example - you can apply arithmetic operation on word embeddings:
- King - Man + Women = Queen
NLP Tasks
An interesting task in NLP is Question Answering where a model is trained on passages and questions to predict the correct answer from multiple choices. The two most widely known datasets used as benchmark evaluations in reading comprehension tasks are SQuAD and RACE. Both these datasets contain thousands of passages and multiple choice questions. SQuAD contains domain specific articles such as passages from news or fiction stories, while RACE contains almost all kinds of human articles in a variety of styles. The RACE dataset is a comprehensive collection of challenging reading comprehension questions for tests in China, written by experts to measure the level of understanding and reasoning. At the time of writing of the RACE research paper in late 2017, there was a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). In the RACE dataset leaderboards, the ALBERT-Single Choice model recently achieved 91.4% accuracy, showing the fast and dramatic improvement in model accuracy in the last 2 to 3 years.
Transformer Architecture
This is a transformer architecture that includes input word embeddings, encoder, decoder, and multi-head attention mechanisms. The attention mechanism helps address the long range dependcies in long text sequences.