How Yipit Scales Thumbnailing With Thumbor and Cloudfront

Yipit, like many other sites, displays collected images, and like many other sites, Yipit needs to display these images in different sizes in different places.

Up until recently, we were using django-imagekit, which works pretty well but has presented some issues as we’ve grown.

Dynamic Generation Issues

Imagekit supports dynamic thumbnail generation, but it checks for and creates the image while rendering the final url for where the image will be accessed. This means, in order to take advanatge of this feature, rendering a page with 10 thumbnails needs to, best case, make 10 network calls to check for the images existence, and worst case scenario, retrieve, process, and upload 10 images.

Pre-Generation Issues

The other option, which is how Yipit used Imagekit, is to pre-generate all thumbnails and assume they’ve been created properly when rendering a pages. Of course, they haven’t always been generated properly, so having a system to find and re-generate those images is a pain to maintain. Also, adding a new image size for a new design requires going back and creating new thumbnails for all of the old images.

Solution Requirements

We wanted a thumbnailing solution that would:

Dynamically create images when they’re needed
Serve those images quickly
Not slow down server response times

We were able to achieve this goals using Thumbor behind AWS Cloudfront.

Thumbor is a service written in python that allows you to pass the url of an image as well as thumbnailing options in a URI and then dynamically creates the images. There are libraries for URI generation in Python, Node.js, Ruby, and Java.

Installing Thumbor

Thumbor is installable via pip. We run it in a virtualenv at /var/www/thumbor-env so our entire installation is essentially this:

$ cd /var/www
$ virtualenv thumbor-env
$ source thumbor-env/bin/activate
$ pip install thumbor

Configuring the thumbor server and the available options are very well documented with a sample configuration file.

We then run it with supervisor behind nginx with these configurations:

Supervisor:

[program:thumbor]
command=/var/www/thumbor-env/bin/python /var/www/thumbor-env/bin/thumbor --port=900%(process_num)s --conf=/var/www/thumbor-env/thumbor/thumbor.conf
process_name=thumbor900%(process_num)s
numprocs=4
user=ubuntu
autostart=true
autorestart=true
stdout_logfile=/var/log/supervisor/thumbor900%(process_num)s.stdout.log
stdout_logfile_backups=3
stderr_logfile=/var/log/supervisor/thumbor900%(process_num)s.stderr.log
stderr_logfile_backups=3

nginx:

upstream thumbor {
    server 127.0.0.1:9000;
    server 127.0.0.1:9001;
    server 127.0.0.1:9002;
    server 127.0.0.1:9003;
}

server {
    listen 8000;
    server_name thumbor.yipit.com;
    # merge_slashes needs to be off if the image src comes in with a protocol
    merge_slashes off;
    location ^~ /thumbor/ {
        rewrite /thumbor(/.*) $1 break;
        proxy_pass http://thumbor;
    }

    location / {
        proxy_pass http://thumbor;
    }


}

This setup runs 4 tornado processes load balanced behind nginx.

Setting Up Cloudfront

Setting up Cloudfront is also very easy. If you want to setup a dedicated cloudfront distribution for Thumbor just create a new distribution through the AWS Web Console set your thumbor url (In this case thumbor.yipit.com:8000) as your origin domain.

If you want to set up a namespace for thumbor on an existing distribution, first you’ll need to create a new origin pointing to your thumbor server and then a new behavior with the pattern thumbor/* that points at that origin.

Using it in your application

Now your application server doesn’t need to worry about image generation or exstience. All you need to do is render thumbor URIs. Here’s the function we use at Yipit:

from django.conf import settings
from libthumbor import CryptoURL

def thumb(url, **kwargs):
    '''
        returns a thumbor url for 'url' with **kwargs as thumbor options.
 
        Positional arguments:
        url -- the location of the original image
 
        Keyword arguments:
        For the complete list of thumbor options
        https://github.com/globocom/thumbor/wiki/Usage
        and the actual implementation for the url generation
        https://github.com/heynemann/libthumbor/blob/master/libthumbor/url.py
    '''
    if settings.THUMBOR_BASE_URL:
        # If THUMBOR_BASE_URL is explicity set, use that
        base = settings.THUMBOR_BASE_URL
    else:
        # otherwise assume that thumbor is setup behind the same
        # CDN behind the `thumbor` namespace.
        scheme, netloc = urlparse.urlsplit(url)[:2]
        base = '{}://{}/thumbor'.format(scheme, netloc)
    crypto = CryptoURL(key=settings.THUMBOR_KEY)

    # just for code clarity
    thumbor_kwargs = kwargs
    if not 'fit_in' in thumbor_kwargs:
        thumbor_kwargs['fit_in'] = True

    thumbor_kwargs['image_url'] = url
    path = crypto.generate(**thumbor_kwargs)
    return u'{}{}'.format(base, path)

This can also be easily wrapped from a template tag:

@register.simple_tag
def thumbor_url(image_url, **kwargs):
    return thumb(image_url, **kwargs)

making adding thumbnails to your pages as as easy as:

<img height="192" width="192" src="{% thumbor_url img_url width=192 height=192 %}" />

Zach Smith is the VP of Engineering at Yipit. You can follow him on twitter @zmsmith and follow @YipitDjango for more django tips from all the yipit engineers.

Oh, by the way, we’re hiring.

Jan 3rd, 2013

aws, thumbor,

Lessons Learned From a Redis Outage at Yipit

About two weeks we had an outage on our main redis server. In dealing with that issue, we came up with some guidelines to ensure easier administration and lessen the chance of a future outage. Keep in mind that here at yipit, we use redis mostly as a datastore for denormalized or precalculated data.

Redis Outage

Sometime at the beginning of the week, we noticed our memory usage climbing. We use an m1.large EC2 instance with 7.5 GB of memory, so when our memory usage shot up to 6 GB, it was time to start analyzing our keys and cleaning up.

We found keys related to functionality we longer support and decided to delete them. We ended up with script along the lines of:

import redis
conn = redis.StrictRedis()

def clear_keys(pattern):
    keys = conn.keys(pattern)
    pipe = conn.pipeline()
    for key in keys:
        pipe.delete(key)
    pipe.execute()

clear_keys('foo:bar:*')

Using this method, we were able to quickly clear out a few hundred thousand keys.

The next morning the site started going in and out intermittently. We were able to narrow this down to our redis server, which we realized was out of memory. This was surprising since we had cleared out a large chunk of data the night before.

It turns out the real culprit was our redis data persitence directory had run out of disk space. We use AOF persistence and were using BGREWRITEAOF via a cron job to compress the file. After issuing hundreds of thousands of deletes the night before, our AOF file had grown quite large. When the backround rewrite was issued and a temporary file was created, the disk ran out of space. With the disk full, our redis server had nowhere to flush the commands that needed to go into the AOF and ran out of memory.

To fix this, we had to delete the temporary file and stop redis, creating enough room for and forcing the AOF commands to flush to disk. We then moved the file to a larger volume, remounted the directory and started redis back up. This whole precedure, including the time it took for redis to load the data from the AOF file into memory, took around 45 minutes.

Looking back on the whole ordeal, we came up with these three guidelines going forward.

Lessons Learned

Redis Keys are not Forever

Whether it’s a timed expiry set directly in redis, or a logical expiry that explicitly deletes keys based on application conditions, all keys should eventually be deleted. Keys with a logical expiry should have a cleanup job to ensure they don’t get left behind. This helps keeps our memory footprint lean. At yipit, if we need data forever, redis is not the correct data store.

Namespace All The Keys

Aside from helping to ensure uniqueness, namespacing keys is a huge help for administration. Want to run analysis against our deal click tracking data? Just analyze the deal:clicks:* keys.

Use Separate Character Sets for the Static and Dynamic Components of Key Names

At yipit, most of our redis data maps to either a time or a row from our MySQL database. To keep administrative analysis easy, we use lowercase letters for the static parts of our keys and numeric characters for the dynamic parts of our keys. This makes it easy to do things like find the unique types of keys we have stored in redis.

Watch Your Disk Space

This should go without saying, but it was really the root cause.

The End

Hopefully you do a better job watching disk space and don’t need to do emergency maintenance. Either way, when you have to do some sort of administration, emergency or not, hopefully you can learn from our mistakes.

Zach Smith is the Technical Product Manager at Yipit. You can follow him on twitter @zmsmith and follow @YipitDjango for more django tips from all the yipit engineers.

Oh, by the way, we’re hiring.

Sep 27th, 2012

redis

How We Split-test Using One Line of Code

We were convinced of the many, many benefits of split-testing. But in true lean startup fashion, we wanted to implement a MVP first to better understand what a more feature-rich system might look like.

While Eric Ries has also suggested simple ways to split test, we were able to do it using one line of code and a SQL query.

How we did it

Add one line of code

if not mod(user.id, 2):
        # [do something new]

Let the experiment run
Analyze the test and control groups on a key metric you track (e.g. % clicker) since the split test was started. This is easily done in MySQL*:

SELECT MOD(user_id, 2), count(*)
FROM event_event
WHERE action = “click” AND date > “2012-07-10”
GROUP BY MOD(user_id, 2)

Compare the numbers for both the control and test group to see how you did!

Why does this work?

Random: if you’re like most companies, there’s no fundamental difference between users whose ids are odd, even or divisible by 10, so using user_id controls for any potential biases
Consistent: the user_id isn’t going to change once a user is registered so we can ensure a consistent experience for tests that might require a couple days to show results.
Deterministic: When you have a lot of users, you don’t want to have to store which users are in which split test. Splitting users by modulus makes it easy to analyze the results through querying MySQL.
Controllable: By changing the modulus, you can set the percentage of users you want to be in the test group. Want to test a risky idea? Use not mod(user.id, 10) to only experiment on 10% of your users.

What about significance?

We use the normal approximation to Fisher’s exact test to test for equality of two percentages. It’s easy!

Conclusion

It’s been 5 months and we’ve run over 50 split tests, killed some very expensive potential features, and saw a simple subject line optimization bump retention by 15%. Since then, we’ve built a more robust system, but that’s another post.

We’ve also stopped arguing for hours about features and now argue about which keyboard layout is better.

If you follow the steps above, you should be split-testing in less than a week. We were.

Sep 11th, 2012

How Yipit Deploys From Github With Multiple Private Repos

Here at Yipit we love using Github. It is a great way to manage our public and private repos and hand off the grunt work of git management. Even better is that we get to use it with Chef to deploy code to our servers on Amazon EC2.

It is a pretty straightforward process for us to start a server and get repository access:

1. Start a new server with the Knife command.

2. Generate a new SSH key

3. Reach out to the Github API and register that SSH key as a Deploy Key with our repository.

TIP: When registering your key register it with a name that can easily be read by humans and machines like yipit_prod_web1 (i-1234abcd). This will make it easier to manage them in code or from the web interface.

Great! Now everything works perfectly. You can easily deploy to your machines with fabric, Chef, or on the command line. It doesn’t even have to be kicked off by a developer so you can do it from a central deploy server.

Cool, we have a few other apps under active development that we want to pull from Github so lets add our key to a few more private repos and we can just …errr…ummm

{
  "errors": [
    {
      "resource": "PublicKey",
      "message": "key is already in use",
      "code": "custom",
      "field": "key"
    }
  ],
  "message": "Validation Failed"
}

Well, that’s not good. How do we get access to multiple private repos then? Let’s ask Github Help

The Github Way(s)

1. SSH Agent Forwarding

Do deploys from your local machine by forwarding your SSH credentials when logging in to each server. Works for code rollouts via fabric from a developers machine, but not if you want to automate your deployments.

2. Deploy Keys

The method discussed above but we know that Deploy Keys are globally unique across your repositories. Github makes sure to note the downsides of deploy keys as well. Any machine with a deploy key has full access to your source control and the keys will not have a passphrase.

3. Machine Users

Give each machine a user account on Github and authenticate as if they are a person. Not a very automatable or scalable solution.

The Better Way

The options Github provides don’t seem to work very well for our requirements:

Automated (no human interaction)
Each machine should be able to access multiple private repositories.

We get close with Deploy Keys but we are limited to a single repository per SSH key. The solution? Give each machine multiple SSH keys.

Building a better Deploy Key

Using multiple SSH keys can get messy, fast. This means we will want to build an abstraction around it so we don’t directly interface with the complexity (enter Chef or your own homegrown solution). But first we should explain what we are going to do inside our magical abstraction.

1. Creating a new SSH Key

Now we need to figure out how to automatically create new SSH keys and use them when interacting with git.

We can easily create new SSH keys and add them to ~/.ssh/ with the proper permissions.

ssh-keygen -b 4096 -t rsa -f /home/${USER}/.ssh/${REPO}_rsa -P ""

This will create a new 4096 bit RSA public/private keypair with a custom name so that we don’t overwrite our default keys.

NOTE: We are creating this key without a passphrase since our automation will not have the ability to ask for human input.

2. API Access

To automate this process we will need to interact with the Github API instead of using the web dashboard. There are some great Github API libraries out there that we could use, but for now we want to keep it simple. Simple as in Bash:

SSH_KEY = "/home/${USER}/${KEY_NAME}.pub"

# Trim extraneous trailing lines
SSH_KEY=$(cat ${SSH_KEY} | tr -d '\n')

curl -d "{\"title\": \"${KEY_NAME}\", \"key\": \"${SSH_KEY}\"}" -H "Authorization: token ${GITHUB_API_KEY}" https://api.github.com/repos/${ORG_NAME}/${REPO}/keys

Chef Note: We use a Ruby version of this script inside a custom LWRP as an interface that can be used across recipes. We swallow errors from trying to add the same key to the same repo although the Right Way(TM) is to remember if you have added the key already and just skip the step on subsequent runs.

Now we can easily add new keys to any repo we control. This is a good start, but it doesn’t solve the issues with using multiple SSH keys.

3. Making Git Behave

We need Git to use these new keys. If you are using SSH keys with Git, it will default to your id_rsa keypair (or DSA if you prefer). If you know the server name you will be connecting to you can specify a key in your ssh_config file, but when contacting Github all the servers look the same, regardless of repository. We need a different solution.

Git allows you to specify a custom script to run when contacting a remote system (git-fetch or git-push). From the git documentation:

GIT_SSH

If this environment variable is set then git fetch and git push will use this command instead of ssh when they need to connect to a remote system. The $GIT_SSH command will be given exactly two arguments: the username@host (or just host) from the URL and the shell command to execute on that remote system.

To pass options to the program that you want to list in GIT_SSH you will need to wrap the program and options into a shell script, then set GIT_SSH to refer to the shell script.

Usually it is easier to configure any desired options through your personal .ssh/config file. Please consult your ssh documentation for further details.

Unfortunately this isn’t very explicit about what you really need to do, but it is pretty straightforward once you have an example.

Our script will look like this:

#!/bin/bash
/usr/bin/env ssh -q -2 -o "StrictHostKeyChecking=yes" -i "/home/my_user/.ssh/my_repo_rsa" $1 $2

We are specifying four custom options:

-q: Quiet. We prefer that our SSH connections aren’t extremely verbose when we run them from Chef.
-2: Force SSHv2. Version 1 has issues and is only included for backwards compatability. Github is on top of their updates (and they were founded after SSHv2 was already standard) so we disable the ability to degrade to SSHv1.
-o "StrictHostKeyChecking=yes": We want to ensure SSH is forcing the Remote Host Key validation since it will help prevent MITM attacks. This can be tricky since we need to make sure we have a solid ~/.ssh/known_hosts file. Read more on this option here. To retrieve Github’s server fingerprint we can run:

# Get the key and output it with the server address hashed
ssh-keyscan -H github.com

# Get the key and output it with the server address in plaintext
ssh-keyscan github.com

Be aware that this command could also be affected by MITM attacks so it is best to validate this out of band (multiple locations, different ISPs etc) before copying it into your known_hosts file. We will use Chef templates to manage placing this on our servers as we migrate to strict key checking. You should have a good way to update this if Github switches keys since it will break your rollouts.

-i /path/to/key: This option allows you to specify a private key file to use when connecting with SSH. This is what allows us to map keys to repositories on Github, sidestepping the uniqueness constraint on RSA keys across repositories.

We place this script somewhere safe (alongside our ssh keys works for us) and make sure it is executable.

$ ll ~/.ssh/
-rw-r--r-- user user authorized_keys
-rw-rw-r-- user user known_hosts
-rw------- user root id_rsa
-rw-r--r-- user root id_rsa.pub
-rw------- user root my_repo_rsa
-rw-r--r-- user root my_repo_rsa.pub
-rwx------ user user my_repo_ssh_wrapper.sh

Once we have this script setup, whenever we are using git to interact with my_repo we need to set GIT_SSH.

A simple script to update a repository from the command line:

#!/bin/bash

cd my_repo

# Set GIT_SSH for this terminal session
export GIT_SSH=~/.ssh/my_repo_ssh_wrapper.sh

# Run our git commands
git pull

4. Automation

Now we know how to add keys to Github, create new keys, and force git to use these new keys. To automate this we just need to stitch these pieces together. In our case we use Chef and this is done very simply:

Create a new SSH key
Use our LWRP to add this key to Github via the API
Template our shell script for use with GIT_SSH
Use the Git deploy provider to download the repo and sync changes during Chef runs. Thankfully the provider has support for using GIT_SSH commands. A sample git block

git "update #{my_repo}" do
  user username
  group groupname
  repository "git@github.com:Yipit/#{my_repo}.git"
  reference branch
  destination "/var/www/#{my_repo}"
  ssh_wrapper "/home/#{username}/.ssh/#{my_repo}_ssh_wrapper.sh"
  action :sync
end

An additional LWRP could be built around the first 3 steps to make it very simple to use.

TIP: We should switch to using git clone with the depth parameter specified so that we waste less time on our initial checkout when we have no need for detailed history on the machines.

5. Cleaning up old keys

Well, we are all set now aren’t we? Not quite, we still have the issue of cleaning up old keys from dead machines. We can do this through the web console with some painful window swapping to check which servers still exist, but this sucks. Don’t worry, there is an easy fix. We just need a little more automation:

Contact the AWS EC2 API and get a list of all of our staging and production instance-ids
Contact the Github API and get a list of all of the deploy key names for a our repository
Get a list of every instance ID that is in a deploy key name that is not in the AWS instance-id list (Remember when we said to make a server name that can be easily read by both man and machine?).
Delete the deploy keys that contain these extraneous instance-ids

This script is repeatable so you can have it run regularly off a cron job and/or at the tail end of your instance shutdown code so you don’t have to worry about these keys floating about.

6. The Future

Deploying from Github for our large applications will not always be a good solution as we grow. We are investigating alternative methods for dissociating source control distribution from our release process.

Andrew Gross is a Developer at Yipit

Sep 5th, 2012

aws, chef, git, systems

The Future of Yipit Is Built During Hackathons

The second Yipit Hackathon was a success in every way. Besides being a blast, it brought our team closer together, generated impactful ideas, and featured cool prizes like Romo.

Teams consisting of both developers and business people were given 36 hours to complete their objective: build something useful for Yipit that is awesome enough to win the votes of the other teams. Most of the teams did so well that their projects are either being rolled out, split tested or used internally. Check them out below.

Fortune Teller | Vinicius Vacanti
At Yipit, we use in-house technology to track and analyze key metrics in real time. Vinicius Vacanti, one of our founders, built a tool to give a daily forecast and track progress for each metric.

Deal Machine | Steve Pulec and Dave Tomback
“The No Whammies” built a deal machine that packages deals for a fun day based on where you are and what you want to do. The team integrated with Stripe to handle payments and even offered a package discount.

Deal Voting | Fabio Costa and Gabriel Falcao
Yipit curates deals by providing Yelp reviews, a feature that grew out of the first hackathon. Team “Os-tripaseca” expanded on this concept by allowing users to upvote deals they like and downvote deals they dislike for collaborative filtering.

Democratic Lunch | Jim Moran and Nistha Tripathi
“Yipit’s Winning Ways” built an email-based voting system for electing the two restaurants to order from on an aggregation of Seamless, GrubHub and Delivery.com.

Remote Collaboration | Adam Nelson and Lincoln de Sousa
Up until recently, some of our developers were working remotely. In an effort to facilitate collaboration, Team “World do Mondo” built a technology to share screens and terminals to code simultaneously.

Getting to Know You | Mingwei Gu and David Sinsky
The Yipit family is always growing, and keeping startup culture alive is both important and challenging. Team “Step by Step” built a game that increases and tests your knowledge of your coworkers’ interests and tastes.

3rd Place | Web UI Revisited | Nitya Oberoi and Ben Plesser
Team “Bet-ya” re-envisioned our web interface with better navigation and grouping deals by popularity. Many of the concepts they came up with are now being used across our different products.

2nd Place | Automated Split Testing | Alice Li and Zach Smith
We are scientists. We constantly split test our ideas. Team “A to Z” built a system to automatically launch, track and rollout our experiments. All of our features, including many of these hackathon projects, are being split tested using this system!

1st Place | Dynamic Deal Recommendation | Suneel Chakravorty and Henry Xie
“The Sick Bros” won the hackathon by building a machine learning algorithm that further personalizes our deal recommendation system by adapting to user behavior.

With no strings attached, developers are empowered to take on more ambitious challenges. Many other companies, such as 37signals, are experimenting with making Hackathons a part of their regular process. So are we.

Aug 28th, 2012

Culture, Hackathon, Python