The following is the first in a series of blog posts on Search at Yipit.
The ability to search for information is something our generation takes for granted; almost any piece of knowledge on the internet is only a few keystrokes away via Google. Even though most users only see the end product (a user-facing query input box, and the resulting list of hits), the technology behind search is fascinating, bringing together applied academic theory, computer systems architecture, and application development.
Recently, I have been working on optimizing the performance of our own search engine at work, and I now have a much better appreciation for this technology. This post is a non-tech-heavy overview of how our implementation of search works, to give a sense of how cool the concepts and challenges are even to readers who may not have a technical background.
This description is only for one possible implementation of search - specifically, one using ElasticSearch, a search “server” (aka a library, or more generally a type of search engine you can customize, build and use for your own set of data) launched in 2010, based on another search library called Apache Lucene. This is not necessarily how other companies (ie Google) do search, but a lot of the concepts here are generalizable.
Basic building blocks
Any search technology has to be based on a search index, which is a database filled with the information that the search engine can look through and bring up in response to a query. The pieces of information in an index are called documents, which can be anything from individual recipes, to TV shows, to web pages. A document is the smallest coherent unit of information that you have. For example, for Netflix a document might be a particular video file (either an episode or a clip), and this document will have various fields of information such as what show it’s from, when it was created, user ratings, etc.
Documents need to follow certain rules about what fields it can possibly have; these rules are defined by the creator of the index in what is called a mapping, which helps define various characteristics of the index. The mapping can tell the index, for example, that video clip documents have a title, which comes in the form of text, an upload date, which comes as a ‘datetime’, a rating, which is a number, and so on.
These documents have to be inputted into the search index, and this data might come from a separate database, which provides what the coder considers a reliable, definitive source of “what’s right”. However, one challenge of maintaining a search index is that the underlying data can change - videos are constantly being rated on Netflix, Twitter posts keep popping up, and a webpage already indexed by Google might change its content. Thus, the documents in an index might need to be updated often.
One caveat here is that a website might use multiple indexes to handle different kinds of documents. For example, Twitter might have one index for users and a separate one for tweets. A creator of a search index can play around with the structure of the indexes to improve the index’s efficiency.
The case of storing text: Analyzers and tokens
A document going into an index needs to be processed before it can be stored. A lot of this processing corresponds to features of search that users normally expect. For example, we expect Google to realize that “restaurant”, “Restaurant” and “restaurants” all generally mean the same thing to us. We also expect it to roughly understand that “restaurant” is close in meaning to “dinner” or “food”.
When a document comes into the index, each field gets processed based on the kind of content. A text field, for example, might first be broken up into individual words by a tokenizer. These tokens are then passed into analyzers which provide some interpretation around the tokens. A simple analyzer might ignore certain common words like “a”, “the”, or “at” (although this poses a challenge for phrases with lots of common words, like “to be or not to be”). Another analyzer might stem words into a standardized form, for example turning “running” into “run” (this gets tricky with words that have similar stems but fairly distinct meanings, such as “computer” and “compute”). A more advanced analyzer might link a word with synonyms, which would help with the example of “restaurant”, “dinner”, and “food” above. These processed tokens of information are better suited to be stored in the index, to allow for all the different kinds of features that users expect.
After processing, some fields will become indexed, which means that user queries will be compared to these fields to try and find the relevant hits. Some fields of the document may not be indexed, but will still be associated with the document. An example of a non-indexed field might be the date that a page was published: a user is probably not going to query for a date, so you don’t want a query for “Red October” to return all documents that were published in October. But, you might want to show the user that a page on “Red October” was published in January 2013, so it’s still useful to have that date published field associated with that particular document.
Finding the hits
So now we have many documents that have been cleaned up and standardized - how does the search engine find documents that are appropriate for a query? The answer is a bit academic.
One way to do it is with an inverted index, which is what Lucene uses. To construct an inverted index, the search engine will take the processed fields of an incoming document and note which words (really, tokens) are present in that document. The index will then create a map between individual tokens and which documents have that token. This essentially “inverts” the relationship between the document and its contents; the index knows, in aggregate, all the tokens that are present across all its documents, and which documents to go to to find any particular token.
For example, an index of recipes might know that the word “rhubarb” pops up in documents #3, 105, and 230, but not in any other documents. One advantage of this kind of indexing is that it is relatively fast for an index to take a word in a query and immediately go to the documents that contain that word, since it already has the map. The more complicated and resource-intensive part of this system is at the point when documents get put into the index: the index has to process the document, and add the document as a place to look for all of its constituent tokens.
There is some processing done on the user’s query as well before the index tries to find results. In fact, the processing done on the query is generally similar to the analyzing done on a new document getting put into the index. If the word “running” in a document is stemmed down to “run” to be stored in the index, we also want a user’s search for “running” to be translated into a search for “run”.
Things get more interesting with multi-word queries, like “running outdoors”. After the query is analyzed, what should the index do with the different tokens that result? One option is to say that any hit must have all the query’s tokens, “run” AND “outdoor” (if that is how the analyzers work) - the index knows which documents have “run” and which have “outdoor”, so the hits would be the documents in both of those lists. An even more restrictive way of searching would be to say that the hits must have the phrase “run outdoor” together, while a less restrictive option would be to look for documents that have “run” OR “outdoor”. We can also certainly combine all three, and have the index sort “run outdoor” hits first, then “run” AND “outdoor”, then “run” OR “outdoor”.
Organizing the hits
Given a query, the index now knows, by virtue of its inverted index structure, which documents have the words of that query in the document. Of course, not all documents that contain a token are equally good results for the query, nor are all words equally valuable in finding correct results. Words that appear in too many documents might just be common words, like “the”, if it wasn’t stripped out by a tokenizer earlier. Along another line of reasoning, a document that has the word “Galapagos” 10 times is probably a better result for the query “Galapagos” than a document that only has that word once.
Compensating for common words across all documents while boosting documents that contain a word multiple times is an algorithmic question; one popular formula to use is the term frequency-inverse document frequency (tf-idf) calculation, which is more advanced than an introduction to search needs. The hits for a query might be presented to the user in order of tf-idf score (highest to lowest) by default.
Or, the creator of the index might want to impose a different way to sort the hits. If the index contains news articles, we might want to always see the most recent article first in the results list. Sorting can be based on a complicated formula that considers a number of factors, including a hit’s tf-idf score. Figuring out how to sort a list of results by default is an interesting product design question.
The user may also want to filter out certain hits completely. Any field of the index’s documents (as defined by the mapping) can be used to exclude hits. A filter of restaurants on Yelp, for example, might be set to include only ones with ratings higher than four stars.
And that’s search in a nutshell! There are lots of cool, advanced topics in this realm, including interesting types of fields you can have, like geographic points, as well as features like query autocomplete that are based on the same search technology. Search is a cool area of computer science that is already ubiquitous, and which will have an increasingly important role as users of the internet generate more and more data every day.
Allen Yang is a Product Analyst at Yipit. The above was originally posted on his personal blog.