Making Django ORM Queries 30% Faster by Changing 8 Characters

There are dozens of ways to optimize your database queries with the Django ORM. As usual, the Django documentation is great and has a good list of them.

In the middle of that page, it states Understand select_related() and prefetch_related() thoroughly, and use them, with links to their respective method documentation. The prefetch_related() documentation goes a bit more in depth:

select_related works by creating a SQL join and including the fields of the
related object in the SELECT statement. For this reason, select_related gets
the related objects in the same database query. However, to avoid the much
larger result set that would result from joining across a 'many'
relationship, select_related is limited to single-valued relationships -
foreign key and one-to-one.

prefetch_related, on the other hand, does a separate lookup for each
relationship, and does the 'joining' in Python. This allows it to prefetch
many-to-many and many-to-one objects, which cannot be done using
select_related, in addition to the foreign key and one-to-one relationships
that are supported by select_related.

tl; dr: select_related does a join, prefetch_related does two separate queries.

We’ve been using prefetch_related extensively for m2m fields ever since we upgraded to Django 1.4, but hadn’t spent much time looking into using prefetch_related for foreign key fields since we were already using select_related for those. We spent some time benchmarking prefetch_related for foreign keys in different scenarios and were surprised by the size of the improvement in some of the scenarios.

Necessary Warning: There are a ton of different variables at play here including network speed, data size, database load, etc. You absolutely need to be profiling your queries to determine whether or not this (or any) optimization will make sense in a given scenario.

Hypothetical Scenario

Let’s imagine you have some models like the following:

models.py

1
2
3
4
5
6
class Publisher(models.Model):
    name = models.CharField(max_length=100)

class Book(models.Model):
    name = models.CharField(max_length=100)
    publisher = models.ForeighKey(Publisher)

and in our views.py:

1
2
3
4
5
def books(request, template="books.html"):
    context = {
        'books': Book.objects.all().select_related('publisher'),
    }
    return render(request, template, context)

and books.html:

1
2
3
4
5
6
7
8
9
10
11
12
13
<h1>Awesome Books</h1>
{% for book in books %}
    {% if 'Awesome' in book.name %}
        <h2>{{ book.name }}</h2>
        <h3>{{ book.publisher.name }}</h3>
    {% endif %}
{% endfor %}

<h1>All Books</h1>
{% for book in books %}
    <h2>{{ book.name }}</h2>
    <h3>{{ book.publisher.name }}</h3>
{% endfor %}

The select_related does it’s job. A join happens between the book and publisher tables and all the data is pulled in one query. If we didn’t have the select_related, a query of the publisher table would be done for every book iteration in the template.

Now let’s imagine that we live in a world with lots of books, but there are only a few publishers. Say we have 500 books in our system, but only four publishers. The above code will join the book and publisher tables and create 500 book objects and 500 associated publisher objects.

What happens if we used prefetch_related in our view instead?

1
Book.objects.all().prefetch_related('publisher')

Now two database queries will be done. The first will grab all the books and the second will grab all the publishers. The ORM will create 500 book objects and 4 publisher objects. Then it will associate them. There are two potential reasons why this way might be better.

First, the db may perform better with the two simple queries than the one complicated query. In the above example, it may not seem like that big of a deal, but imagine if you had multiple foreign keys.

Second, we only have to instantiate 4 publisher objects in the latter, whereas we have 500 in the former. This takes less time and uses less memory.

As I warned before, whether or not either of those two will be true in your particular circumstance depends on a large number of factors and you really need to do proper profiling before making a decision. In some scenarios, we have found up to a 30% speed improvement.

Note that in this scenario, we are accessing the queryset multiple times. If you are only accessing the queryset once, there is a good chance that using iterator() and select_related is the better option.

Steve Pulec is a developer at Yipit. You can follow him on twitter @spulec and follow @YipitDjango for more django tips from all the Yipit engineers.

Oh, by the way, we’re hiring.

orm

Comments