A Non-Tech Introduction: Search

The following is the first in a series of blog posts on Search at Yipit.

The ability to search for information is something our generation takes for granted; almost any piece of knowledge on the internet is only a few keystrokes away via Google. Even though most users only see the end product (a user-facing query input box, and the resulting list of hits), the technology behind search is fascinating, bringing together applied academic theory, computer systems architecture, and application development.

Recently, I have been working on optimizing the performance of our own search engine at work, and I now have a much better appreciation for this technology. This post is a non-tech-heavy overview of how our implementation of search works, to give a sense of how cool the concepts and challenges are even to readers who may not have a technical background.

This description is only for one possible implementation of search - specifically, one using ElasticSearch, a search “server” (aka a library, or more generally a type of search engine you can customize, build and use for your own set of data) launched in 2010, based on another search library called Apache Lucene. This is not necessarily how other companies (ie Google) do search, but a lot of the concepts here are generalizable.

Basic building blocks

Any search technology has to be based on a search index, which is a database filled with the information that the search engine can look through and bring up in response to a query. The pieces of information in an index are called documents, which can be anything from individual recipes, to TV shows, to web pages. A document is the smallest coherent unit of information that you have. For example, for Netflix a document might be a particular video file (either an episode or a clip), and this document will have various fields of information such as what show it’s from, when it was created, user ratings, etc.

Documents need to follow certain rules about what fields it can possibly have; these rules are defined by the creator of the index in what is called a mapping, which helps define various characteristics of the index. The mapping can tell the index, for example, that video clip documents have a title, which comes in the form of text, an upload date, which comes as a ‘datetime’, a rating, which is a number, and so on.

These documents have to be inputted into the search index, and this data might come from a separate database, which provides what the coder considers a reliable, definitive source of “what’s right”. However, one challenge of maintaining a search index is that the underlying data can change - videos are constantly being rated on Netflix, Twitter posts keep popping up, and a webpage already indexed by Google might change its content. Thus, the documents in an index might need to be updated often.

One caveat here is that a website might use multiple indexes to handle different kinds of documents. For example, Twitter might have one index for users and a separate one for tweets. A creator of a search index can play around with the structure of the indexes to improve the index’s efficiency.

The case of storing text: Analyzers and tokens

A document going into an index needs to be processed before it can be stored. A lot of this processing corresponds to features of search that users normally expect. For example, we expect Google to realize that “restaurant”, “Restaurant” and “restaurants” all generally mean the same thing to us. We also expect it to roughly understand that “restaurant” is close in meaning to “dinner” or “food”.

When a document comes into the index, each field gets processed based on the kind of content. A text field, for example, might first be broken up into individual words by a tokenizer. These tokens are then passed into analyzers which provide some interpretation around the tokens. A simple analyzer might ignore certain common words like “a”, “the”, or “at” (although this poses a challenge for phrases with lots of common words, like “to be or not to be”). Another analyzer might stem words into a standardized form, for example turning “running” into “run” (this gets tricky with words that have similar stems but fairly distinct meanings, such as “computer” and “compute”). A more advanced analyzer might link a word with synonyms, which would help with the example of “restaurant”, “dinner”, and “food” above. These processed tokens of information are better suited to be stored in the index, to allow for all the different kinds of features that users expect.

After processing, some fields will become indexed, which means that user queries will be compared to these fields to try and find the relevant hits. Some fields of the document may not be indexed, but will still be associated with the document. An example of a non-indexed field might be the date that a page was published: a user is probably not going to query for a date, so you don’t want a query for “Red October” to return all documents that were published in October. But, you might want to show the user that a page on “Red October” was published in January 2013, so it’s still useful to have that date published field associated with that particular document.

Finding the hits

So now we have many documents that have been cleaned up and standardized - how does the search engine find documents that are appropriate for a query? The answer is a bit academic.

One way to do it is with an inverted index, which is what Lucene uses. To construct an inverted index, the search engine will take the processed fields of an incoming document and note which words (really, tokens) are present in that document. The index will then create a map between individual tokens and which documents have that token. This essentially “inverts” the relationship between the document and its contents; the index knows, in aggregate, all the tokens that are present across all its documents, and which documents to go to to find any particular token.

For example, an index of recipes might know that the word “rhubarb” pops up in documents #3, 105, and 230, but not in any other documents. One advantage of this kind of indexing is that it is relatively fast for an index to take a word in a query and immediately go to the documents that contain that word, since it already has the map. The more complicated and resource-intensive part of this system is at the point when documents get put into the index: the index has to process the document, and add the document as a place to look for all of its constituent tokens.

There is some processing done on the user’s query as well before the index tries to find results. In fact, the processing done on the query is generally similar to the analyzing done on a new document getting put into the index. If the word “running” in a document is stemmed down to “run” to be stored in the index, we also want a user’s search for “running” to be translated into a search for “run”.

Things get more interesting with multi-word queries, like “running outdoors”. After the query is analyzed, what should the index do with the different tokens that result? One option is to say that any hit must have all the query’s tokens, “run” AND “outdoor” (if that is how the analyzers work) - the index knows which documents have “run” and which have “outdoor”, so the hits would be the documents in both of those lists. An even more restrictive way of searching would be to say that the hits must have the phrase “run outdoor” together, while a less restrictive option would be to look for documents that have “run” OR “outdoor”. We can also certainly combine all three, and have the index sort “run outdoor” hits first, then “run” AND “outdoor”, then “run” OR “outdoor”.

Organizing the hits

Given a query, the index now knows, by virtue of its inverted index structure, which documents have the words of that query in the document. Of course, not all documents that contain a token are equally good results for the query, nor are all words equally valuable in finding correct results. Words that appear in too many documents might just be common words, like “the”, if it wasn’t stripped out by a tokenizer earlier. Along another line of reasoning, a document that has the word “Galapagos” 10 times is probably a better result for the query “Galapagos” than a document that only has that word once.

Compensating for common words across all documents while boosting documents that contain a word multiple times is an algorithmic question; one popular formula to use is the term frequency-inverse document frequency (tf-idf) calculation, which is more advanced than an introduction to search needs. The hits for a query might be presented to the user in order of tf-idf score (highest to lowest) by default.

Or, the creator of the index might want to impose a different way to sort the hits. If the index contains news articles, we might want to always see the most recent article first in the results list. Sorting can be based on a complicated formula that considers a number of factors, including a hit’s tf-idf score. Figuring out how to sort a list of results by default is an interesting product design question.

The user may also want to filter out certain hits completely. Any field of the index’s documents (as defined by the mapping) can be used to exclude hits. A filter of restaurants on Yelp, for example, might be set to include only ones with ratings higher than four stars.


And that’s search in a nutshell! There are lots of cool, advanced topics in this realm, including interesting types of fields you can have, like geographic points, as well as features like query autocomplete that are based on the same search technology. Search is a cool area of computer science that is already ubiquitous, and which will have an increasingly important role as users of the internet generate more and more data every day.

Allen Yang is a Product Analyst at Yipit. The above was originally posted on his personal blog.

Advanced Chef: Writing Heavy Weight Resource Providers (HWRP)

The last in a short series of Chef related blog posts

Heavy Weight Resource Providers

This is a bit of a backwards term. When Chef first came out, there was no Light Weight Resource Provider (LWRP) syntax and any hardcore extension to Chef had to be written in Ruby. However, Opscode saw a need to be filled and created LWRP, making it easier to create your own Resources. The problem comes when LWRP cannot fulfill all of your needs. This means you need to fall back to writing pure ruby code. For lack of a better term, I’ll call this method a HWRP, or Heavy Weight Resource Provider.

While writing a LWRP is meant to be simple and elegant, writing a HWRP is meant to be flexible. It gives you the full power of ruby in exchange for elegance.

Let’s go over some interactions between LWRP and HWRP.

A few things you need to know

HWRPs and LWRPS are interchangeable

With LWRP you are taught to create a Resource and a Provider together. This is the simplest way. However, just because you need to convert a resource definition or a provider into a HWRP you do not need to convert both.

The LWRP syntax ’compiles’ into real ruby code, so Chef will not know the difference in how they were defined. A valid cookbook directory structure:

1
2
3
4
5
6
7
8
libraries/
    provider_default.rb
providers/
resources/
    default.rb
recipes/
    default.rb
metadata.rb

HWRPs live in library files

Anything you put in resources/ or providers/ Chef will attempt to parse at runtime. We don’t want Chef trying to read our HWRP as the Chef DSL, we want it to interpret it as code. Luckily, anything stored in the libraries/ folder Chef will try to import at runtime. A good example of this can be seen in the runit cookbook.

How to write a HWRP:

Let’s go through an example. We are going to create a HWRP that is very simple, it could easily be written as a LWRP. In fact, it will be. While we write the HWRP I will post examples of the analogous LWRP code when applicable.

  • Cookbook Name: cloud
  • Resource Name: magic

An example of calling this in a recipe:

1
2
3
4
5
cloud_magic "My Cloud Magic" do
  action :create
  cloud "My Cloud Magic"
  magic true
end

Resources

Class Structure

We need to inherit from the appropriate Chef classes in our HWRP. Note the class hierarchy as well as the inheritance:

HWRP

1
2
3
4
5
6
7
8
9
10
11
require 'chef/resource'

class Chef
  class Resource
    class CloudMagic < Chef::Resource

    # Some Magic Happens

    end
  end
end

LWRP

This has no counterpart in a LWRP as this part is done automagically by Chef when it reads your files.

Initialization Method

We need to override the initialize method to make sure we have some defaults. We aren’t defining all of our resource attributes here, just the ones that need defaults.

HWRP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
require 'chef/resource'

class Chef
  class Resource
    class  CloudMagic < Chef::Resource

      def initialize(name, run_context=nil)
        super
        @resource_name = :cloud_magic          # Bind ourselves to the name with an underscore
        @provider = Chef::Provider::CloudMagic # We need to tie to our provider
        @action = :enable                     # Default Action Goes here
        @allowed_actions = [:create, :remove]

        # Now we need to set up any resource defaults
        @magic = true
        @cloud = name  # This is equivalent to setting :name_attribute => true
      end

    # Some Magic Happens

    end
  end
end

Here is a similar LWRP although it actually defines a bit more than the HWRP due to the terseness of the syntax.

LWRP

1
2
3
4
5
action :enable, :remove
default_action :enable

attribute :magic, kind_of => [TrueClass, FalseClass], :default => true
attribute :cloud, kind_of => String, :name_attribute => true

Attribute Methods

Now lets set up some attribute methods in our HWRP. Make sure to read the code comments for an explanation of what is going on.

HWRP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
require 'chef/resource'

class Chef
  class Resource
    class  CloudMagic < Chef::Resource

      def initialize(name, run_context=nil)
        super
        @resource_name = :cloud_magic          # Bind ourselves to the name with an underscore
        @provider = Chef::Provider::CloudMagic # We need to tie to our provider
        @action = :enable                      # Default Action Goes here
        @allowed_actions = [:create, :remove]

        # Now we need to set up any resource defaults
        @magic = true
        @cloud = name  # This is equivalent to setting :name_attribute => true
      end

      # Define the attributes we set defaults for
      def magic(arg=nil)
        set_or_return(:magic, arg, :kind_of => [TrueClass, FalseClass])
      end

      def cloud(arg=nil)
        # set_or_return is a magic function from Chef that does most of the heavy
        # lifting for attribute access.
        set_or_return(:cloud, arg, :kind_of => String)
        # For now all you need to know is that the method name should be given as
        # a symbol (like :cloud) and that you declare the type with :kind_of
      end

    end
  end
end

There are no more changes to our LWRP everything we just added with these two methods is already included in the LWRP syntax.

LWRP

1
2
3
4
5
action :enable, :remove
default_action :enable

attribute :magic, kind_of => [TrueClass, FalseClass], :default => true
attribute :cloud, kind_of => String, :name_attribute => true

Awesome, now we have defined our resource. Let’s move on to a basic provider.

Providers

Class Structure

Very similar to resources, here is the basic class structure for a provider.

HWRP

1
2
3
4
5
6
7
8
9
class Chef
  class Provider
    class CloudMagic < Chef::Provider

    # Magic Happens

    end
  end
end

LWRP

Once again, this is pure boilerplate with no analogy in a LWRP.

Initialization Method

While we don’t need to write an initialize method (we can), we do need to override load_current_resource.

HWRP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Chef
  class Provider
    class CloudMagic < Chef::Provider

      # We MUST override this method in our custom provider
      def load_current_resource
        # Here we keep the existing version of the resource
        # if none exists we create a new one from the resource we defined earlier
        @current_resource ||= Chef::Resource::CloudMagic.new(new_resource.name)

        # New resource represents the chef DSL block that is being run (from a recipe for example)
        @current_resource.magic(new_resource.magic)
        # Although you can reference @new_resource throughout the provider it is best to
        # only make modifications to the current version
        @current_resource.cloud(new_resource.cloud)
        @current_resource
      end

      # Magic Happens

    end
  end
end

LWRP

No LWRP changes yet, all we have done so far is set up some boilerplate for the HWRP.

Action Methods

Now it is time to define what we do in our actions, with our HWRP we need to define methods like action_create to define a :create action. Chef will do some introspection to find these methods and hook them up.

HWRP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class Chef
  class Provider
    class CloudMagic < Chef::Provider

      # We MUST override this method in our custom provider
      def load_current_resource
        # Here we keep the existing version of the resource
        # if none exists we create a new one from the resource we defined earlier
        @current_resource ||= Chef::Resource::CloudMagic.new(new_resource.name)

        # New resource represents the chef DSL block that is being run (from a recipe for example)
        @current_resource.magic(new_resource.cloud)
        # Although you can reference @new_resource throughout the provider it is best to
        # only make modifications to the current version
        @current_resource.cloud(new_resource.magic)
        @current_resource
      end

      def action_create
        # Some ruby code
      end

      def action_remove
        # More ruby code
      end

    end
  end
end

For our LWRP, it is pretty simple

LWRP

1
2
3
4
5
6
7
action :create do
  # Some Chef Code
end

action :remove do
  # Some more Chef code
end

Off into the wild blue yonder

At this point you should have a functioning HWRP. Sure, it doesn’t do anything, but it is best to start small.

Also take a look at the swap cookbook, a hybrid cookbook with a provider written as a HWRP.

Now that you can read these, you should be able to start picking apart definitions inside Chef core as they are very similar.

Andrew Gross is a Developer at Yipit, you can find him as @awgross on Twitter

Semantic Chef: Writing Readable Recipes

The second in a short series of Chef related blog posts

Why Chef is awesome

  • Programmable in Ruby
  • Repeatable recipes define your infrastructure

Why Chef is painful

  • Programmable in Ruby
  • Slow ‘REPL’ to test changes

Ruby lets you do just about anything. It’s a programming language, that’s what it is supposed to do. Chef’s DSL is concise, readable and extensible. The power of Chef comes by allowing you to program in Ruby. When mixing Chef’s DSL with the power of ruby, things can get out of hand if you are not careful.

Cookbook Examples

Simple

The Opscode examples show nice things. As a systems person it is very easy to understand what is going on.

1
2
3
4
5
6
7
8
9
10
11
12
13
package "ntp" do
    action [:install]
end

template "/etc/ntp.conf" do
    source "ntp.conf.erb"
    variables( :ntp_server => "time.nist.gov" )
    notifies :restart, "service[ntpd]"
end

service "ntpd" do
    action [:enable,:start]
end

Messy

Add some more complex logic in and we get a bit messier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
if platform_family?("rhel") and node['python']['install_method'] == 'package'
  pip_binary = "/usr/bin/pip"
elsif platform_family?("smartos")
  pip_binary = "/opt/local/bin/pip"
else
  pip_binary = "/usr/local/bin/pip"
end

# Ubuntu's python-setuptools, python-pip and python-virtualenv packages
# are broken...this feels like Rubygems!
# http://stackoverflow.com/questions/4324558/whats-the-proper-way-to-install-pip-virtualenv-and-distribute-for-python
# https://bitbucket.org/ianb/pip/issue/104/pip-uninstall-on-ubuntu-linux
remote_file "#{Chef::Config[:file_cache_path]}/distribute_setup.py" do
  source node['python']['distribute_script_url']
  mode "0644"
  not_if { ::File.exists?(pip_binary) }
end

execute "install-pip" do
  cwd Chef::Config[:file_cache_path]
  command <<-EOF
  #{node['python']['binary']} distribute_setup.py --download-base=#{node['python']['distribute_option']['download_base']}
  #{::File.dirname(pip_binary)}/easy_install pip
  EOF
  not_if { ::File.exists?(pip_binary) }
end

Inscrutable

Sure you can figure out what it does… eventually.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
if node["platform"] == "windows"
    existence_check = :exists?
# Where will also return files that have extensions matching PATHEXT (e.g.
# *.bat). We don't want the batch file wrapper, but the actual script.
    which = 'set PATHEXT=.exe & where'
    Chef::Log.debug "Using exists? and 'where', since we're on Windows"
else
    existence_check = :executable?
    which = 'which'
    Chef::Log.debug "Using executable? and 'which' since we're on Linux"
end

# COOK-635 account for alternate gem paths
# try to use the bin provided by the node attribute
if ::File.send(existence_check, node["chef_client"]["bin"])
  client_bin = node["chef_client"]["bin"]
  Chef::Log.debug "Using chef-client bin from node attributes: #{client_bin}"
# search for the bin in some sane paths
elsif Chef::Client.const_defined?('SANE_PATHS') && (chef_in_sane_path=Chef::Client::SANE_PATHS.map{|p| p="#{p}/chef-client";p if ::File.send(existence_check, p)}.compact.first) && chef_in_sane_path
  client_bin = chef_in_sane_path
  Chef::Log.debug "Using chef-client bin from sane path: #{client_bin}"
# last ditch search for a bin in PATH
elsif (chef_in_path=%x{#{which} chef-client}.chomp) && ::File.send(existence_check, chef_in_path)
  client_bin = chef_in_path
  Chef::Log.debug "Using chef-client bin from system path: #{client_bin}"
else
  raise "Could not locate the chef-client bin in any known path. Please set the proper path by overriding node['chef_client']['bin'] in a role."
end

node.set["chef_client"]["bin"] = client_bin

# libraries/helpers.rb method to DRY directory creation resources
create_directories

case node["chef_client"]["init_style"]
when "init"

  #argh?
  dist_dir, conf_dir = value_for_platform_family(
    ["debian"] => ["debian", "default"],
    ["fedora"] => ["redhat", "sysconfig"],
    ["rhel"] => ["redhat", "sysconfig"],
    ["suse"] => ["suse", "sysconfig"]
  )
# .. snip 200 lines of case statement

What’s Wrong

I dont like this. When I look at a recipe I want to be able to easily understand what it is doing. It is a recipe, not an annotated thesis. It should be a simple browsable description of your intentions. Real cookbooks don’t describe the minutia of each step, they tell you the general steps and leave you to figure out the details.

How?

  • Get your logic out of your recipe, put it in a library.
  • Abstract connected DSL blocks to LWRPs, control will be simpler
  • Use name attributes carefully.
  • Give friendly names to referenced node attributes.

Why?

  • Readability
  • Testability
  • Development Speed
  • Readability

Developers who aren’t that familiar with Ruby or Chef can still easily read recipe code. It is designed to be readable and declarative. The recipes should define the pieces of your infrastructure, not the edge cases.

Testability

It is hard to test Chef directly. Usually it just results in you running all of the code and testing the system’s final state. When you separate your logic from your Chef code, you separate your concerns. You now have logic to which you can easily apply standard Ruby testing methods. The rest of the recipe is now just Chef code that only needs to be checked for regressions.

Development Speed

Now that you have your logic separate, you can write it just like regular Ruby. You don’t need to waste all of your time waiting for VMs to spin up and compile packages. The only time you need to run Chef is to verify your DSL blocks are set up correctly.

Downsides

  • You have to learn about library files
  • You have to write tests
  • You have to name more things.
  • You’ll have to learn about when Chef loads and runs libraries
  • You’ll have to learn more Ruby
  • Less elegant code to force calling from Library files.

Next Step

Chef manipulates Infrastructure as Code, its about time you started treating your infrastructure code like your application code.

Note : Discussions of this could easily devolve into the MVC “where do I put my logic” debate.

Andrew Gross is a Developer at Yipit, you can find him as @awgross on Twitter

Pragmatic Chef: Verifying Remote Files

The first in a short series of Chef related blog posts

Imagine this everyday occurrence:

  • You are writing a Chef cookbook
  • You need to download the source code of a program as a tarball.
  • Being the good programmer you are, you use for Chef’s remote_file resource.
  • You grab the checksum of the file from the website and add it to the resource.

Everything is good right? Unfortunately not.

The Chef remote_file resource does have a checksum property, but it is not meant as a security measure. Instead it is implemented as a bandwidth saving property. If the checksum does not match, it will download the new version but no further validation will be done to match it with the checksum.

Here is an example of using a standard remote_file callback pattern to download and install from source.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
remote_file "Download Redis Source" do
  path "#{download_dir}/#{tarball}"
  source download_url
  backup false
  checksum sha256_checksum
  notifies :run, "execute[Unpack Redis Tarball]", :immediately
end

execute "Unpack Redis Tarball" do
  command "tar -xvzf #{tarball}"
  action :nothing
  cwd download_dir
  notifies :run, "execute[Make Redis]", :immediately
end

We are using the checksum field, but it is only checking for already downloaded versions. Here is a better pattern that forces a correct checksum on the downloaded file before you unpack it any further.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
remote_file "Download Redis Source" do
  path "#{download_dir}/#{tarball}"
  source download_url
  backup false
  checksum sha256_checksum
  notifies :create, "ruby_block[Validate Tarball Checksum]", :immediately
  not_if { redis_exists? && node['redisio']['safe_install'] }
end

ruby_block "Validate Tarball Checksum" do
  action :nothing
  block do
    require 'digest'
    checksum = Digest::SHA1.file("#{download_dir}/#{tarball}").hexdigest
    if checksum != node['redisio']['sha1_checksum']
      raise "Downloaded Tarball Checksum #{checksum} does not match known checksum #{node['redisio']['sha1_checksum']}"
    end
  end
  notifies :run, "execute[Unpack Redis Tarball]", :immediately
end

execute "Unpack Redis Tarball" do
  command "tar -xvzf #{tarball}"
  action :nothing
  cwd download_dir
  notifies :run, "execute[Make Redis]", :immediately
end

In this case Redis provides SHA1 checksums while remote_file uses SHA256 so we must add an additional node attribute that stores the SHA1 checksum. These examples were taken from my fork of the Redisio cookbook.

Andrew Gross is a Developer at Yipit, you can find him as @awgross on Twitter

Making Django ORM Queries 30% Faster by Changing 8 Characters

There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them.

In the middle of that page, it states Understand select_related() and prefetch_related() thoroughly, and use them, with links to their respective method documentation. The prefetch_related() documentation goes a bit more in depth:

select_related works by creating a SQL join and including the fields of the
related object in the SELECT statement. For this reason, select_related gets
the related objects in the same database query. However, to avoid the much
larger result set that would result from joining across a 'many'
relationship, select_related is limited to single-valued relationships -
foreign key and one-to-one.

prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related, in addition to the foreign key and one-to-one relationships
that are supported by select_related.

tl; dr: select_related does a join, prefetch_related does two separate queries.

We’ve been using prefetch_related extensively for m2m fields ever since we upgraded to Django 1.4, but hadn’t spent much time looking into using prefetch_related for foreign key fields since we were already using select_related for those. We spent some time benchmarking prefetch_related for foreign keys in different scenarios and were surprised by the size of the improvement in some of the scenarios.

Necessary Warning: There are a ton of different variables at play here including network speed, data size, database load, etc. You absolutely need to be profiling your queries to determine whether or not this (or any) optimization will make sense in a given scenario.

Hypothetical Scenario

Let’s imagine you have some models like the following:

models.py

1
2
3
4
5
6
class Publisher(models.Model):
    name = models.CharField(max_length=100)

class Book(models.Model):
    name = models.CharField(max_length=100)
    publisher = models.ForeighKey(Publisher)

and in our views.py:

1
2
3
4
5
def books(request, template="books.html"):
    context = {
        'books': Book.objects.all().select_related('publisher'),
    }
    return render(request, template, context)

and books.html:

1
2
3
4
5
6
7
8
9
10
11
12
13
<h1>Awesome Books</h1>
{% for book in books %}
    {% if 'Awesome' in book.name %}
        <h2>{{ book.name }}</h2>
        <h3>{{ book.publisher.name }}</h3>
    {% endif %}
{% endfor %}

<h1>All Books</h1>
{% for book in books %}
    <h2>{{ book.name }}</h2>
    <h3>{{ book.publisher.name }}</h3>
{% endfor %}

The select_related does it’s job. A join happens between the book and publisher tables and all the data is pulled in one query. If we didn’t have the select_related, a query of the publisher table would be done for every book iteration in the template.

Now let’s imagine that we live in a world with lots of books, but there are only a few publishers. Say we have 500 books in our system, but only four publishers. The above code will join the book and publisher tables and create 500 book objects and 500 associated publisher objects.

What happens if we used prefetch_related in our view instead?

1
Book.objects.all().prefetch_related('publisher')

Now two database queries will be done. The first will grab all the books and the second will grab all the publishers. The ORM will create 500 book objects and 4 publisher objects. Then it will associate them. There are two potential reasons why this way might be better.

First, the db may perform better with the two simple queries than the one complicated query. In the above example, it may not seem like that big of a deal, but imagine if you had multiple foreign keys.

Second, we only have to instantiate 4 publisher objects in the latter, whereas we have 500 in the former. This takes less time and uses less memory.

As I warned before, whether or not either of those two will be true in your particular circumstance depends on a large number of factors and you really need to do proper profiling before making a decision. In some scenarios, we have found up to a 30% speed improvement.

Note that in this scenario, we are accessing the queryset multiple times. If you are only accessing the queryset once, there is a good chance that using iterator() and select_related is the better option.

Steve Pulec is a developer at Yipit. You can follow him on twitter @spulec and follow @YipitDjango for more django tips from all the Yipit engineers.

Oh, by the way, we’re hiring.

orm