When we think about lightning-fast search capabilities today, we think about the masters: Google. Using their search engine, you can search the entire internet in the click of a button.
How did they achieve this? Search engines represent text in any document (or web page) with an index which in its simplest form is a list of all the words that occur there. A collection of such documents is represented with a document-term matrix.
This matrix is usually sparse (contains a lot of zeros) because each document only contains a small subset of all possible words and this means that look up is extremely fast.
However, a traditional search index is like putting all the words of a document in a bag and then shaking it. The detailed meaning is gone.
We can partially make up for that by storing not only single words in the bag but also terms that are made up of two words or more such as “New York.”
But this isn’t really the solution because we end up having to store many different terms. Google’s search index stores a lot of terms — how many is unknown but you can get an impression of the scope with Google Trends.
As we discussed in our last blog post, raffle.ai’s machine learning solutions for natural text search is built upon context-aware text representations.
So where the traditional search approach is fundamentally limited, the machine learning approach can, with sufficient training data, learn to pick up subtle contextual differences that may make the difference for finding the correct answers.
Recently, Google has made a major improvement in the search results by using BERT in natural text searches. We see this as a part of an overall trend in search where we go from the traditional sparse search indices to learned dense representations of documents and queries.
raffle’s AutoPilot and CoPilot use a similar document retrieval model. Each retrieved document is an answer by itself so there is no need for an answer generation module. Our “secret sauce” is our methodology to fine-tune from very small labeled datasets.
In the next blog post we will discuss how open source NLP frameworks can accelerate the development of valuable natural text products.
The last blog post in this series will be about the mid- to long-term perspectives for NLP AI. Specifically whether we can expect to get truly artificial intelligent conversational AIs that are context-aware, factually accurate and not prone to pick up unwanted biases in data.