Photo by Marten Newhall on Unsplash

This is the second in our “The science behind the raffle-lution” series. Read the first here.

When we think about lightning-fast search capabilities today, we think about the masters: Google. Using their search engine, you can search the entire internet in the click of a button.

How did they achieve this? Search engines represent text in any document (or web page) with an index which in its simplest form is a list of all the words that occur there. A collection of such documents is represented with a document-term matrix

This matrix is usually sparse (contains a lot of zeros) because each document only contains a small subset of all possible words and this means that look up is extremely fast. 

Google: undisputed masters of internet search.


Traditional search has limitations

However, a traditional search index is like putting all the words of a document in a bag and then shaking it. The detailed meaning is gone.

We can partially make up for that by storing not only single words in the bag but also terms that are made up of two words or more such as “New York.”

But this isn’t really the solution because we end up having to store many different terms. Google’s search index stores a lot of terms — how many is unknown but you can get an impression of the scope with Google Trends.

From sparse to dense search indices

As we discussed in our last blog post, raffle.ai’s machine learning solutions for natural text search is built upon context-aware text representations. 

So where the traditional search approach is fundamentally limited, the machine learning approach can, with sufficient training data, learn to pick up subtle contextual differences that may make the difference for finding the correct answers.

Recently, Google has made a major improvement in the search results by using BERT in natural text searches. We see this as a part of an overall trend in search where we go from the traditional sparse search indices to learned dense representations of documents and queries

A new way to answer questions

In recent work from both Google research and Facebook AI research we see this approach to question answering at scale using fine-tuned BERT models. These follow a two-stage process:

  1. Document retrieval. The document retriever first encodes the whole knowledge base (for example the entirety of Wikipedia!) into a couple of million dense representations. The question is also encoded and based upon a similarity measure (inner product) the top 5 to 10 excerpts of the knowledge base are passed on to the answer generation stage.

  2. Answer generation. The answer generator uses the excerpts from the knowledge base together with the question to generate an answer either by extracting pieces of text from the excerpts or by a generative language model that composes the answer.

raffle’s AutoPilot and CoPilot use a similar document retrieval model. Each retrieved document is an answer by itself so there is no need for an answer generation module. Our “secret sauce” is our methodology to fine-tune from very small labeled datasets.


In the next blog post we will discuss how open source NLP frameworks can accelerate the development of valuable natural text products. 

The last blog post in this series will be about the mid- to long-term perspectives for NLP AI. Specifically whether we can expect to get truly artificial intelligent conversational AIs that are context-aware, factually accurate and not prone to pick up unwanted biases in data.             

Find out more about raffle.ai’s mission on our about page.