Applying a data-driven approach means we use training data instead of defining rules, distinguishing ourselves from traditional chatbots.
How language modelling has changed NLP
This is the first of a series of four posts about the technology behind our products and what the future has in store for AI and natural language processing. Learn more in parts two, three, and four.
raffle.ai makes AI tools to realize our vision of giving employees or end-users seamless access to company information. We use natural language processing (NLP) machine learning so that the user can search with natural text in the same way you would pose questions to humans.
Machine learning needs training data to work — and the more the better. Making data, for example, by labeling historical queries (that associates them with the correct answers) is time-consuming and therefore expensive, and will delay when your AI solution performs well enough to be deployed.
So it sounds like natural text search is out of reach for companies who don’t have the resources to make sufficient training data. But this is actually no longer the case.
Pre-train to make gains
NLP has made a lot of progress in recent years because of what we call pre-training. This has been a real game-changer for achieving good performance with a smaller investment. To explain pre-training we need to be a bit more specific about what we mean by training data when we’re discussing NLP:
- Unlabeled data. This can be text data that we collect from the internet or text available in companies. Practically unlimited unlabeled data is available, but we have to be careful with what we use because our model will learn from it.
- Labeled data. This is expensive data. At raffle.ai our supervised data consists of question-answer pairs. We therefore need to access a number of questions for each answer. The question can come from historical query logs, be collected live and labeled in raffle Analytics, or even be constructed by our in-house AI trainers to start the model off at a reasonable level of performance.
How to train your language model
Once trained, a language model can “understand” the meaning of sentences. Or, more precisely, if we take two sentences that carry the same meaning, then their representations will be similar.
This is a very good foundation for building other NLP applications such as a question-answering system because we now have a way to represent our questions in a way that robustly reflects how we ask them.
So the NLP application recipe à la 2020 is to:
- Pre-train a language model with unlabeled data or — even better — get someone else to supply one for us
- Fine-tune on a small labeled dataset
But how do we leverage large unlabeled datasets to get representations that learn the meaning of sentences? The key here is the context: a single word in a sentence gets part of its meaning from the surrounding text.
So if we train a model to predict a word given a context such preceding words: “Josef walks his <fill in="">” or from surrounding words: “the cat <fill in=""> the mouse,” then the model is forced to learn a representation which is context-aware. </fill></fill>
BERT and beyond
There are many language models on the market. An early famous one is word2vec. A fascinating finding of its representations is that you can do approximate arithmetic with them such as: “king” - “man” + “woman” ≈ “queen”.
Today, the most popular one is BERT which is short for Bidirectional Encoder Representations from Transformers. BERT is a masked language model which means that the model’s task is to predict one or more words that have been masked out of the input, as shown in the example below.
As is often the case in deep learning, more data and larger models help performance. The standard pre-trained BERT model is a 300 million parameter transformer model trained on the entire Wikipedia and other sources.
It sounds gigantic but it is actually possible to put it into production and run it without noticeable lag for the user. You can try it out with raffle today.
In the next post in this series, we’ll look closer at how we fine tune to solve question-answering tasks. We will also look into a larger trend of how search is changing.