Lessons Learned from a Redis Outage at Yipit

About two weeks we had an outage on our main redis server. In dealing with that issue, we came up with some guidelines to ensure easier administration and lessen the chance of a future outage. Keep in mind that here at yipit, we use redis mostly as a datastore for denormalized or precalculated data.

Redis Outage

Sometime at the beginning of the week, we noticed our memory usage climbing. We use an m1.large EC2 instance with 7.5 GB of memory, so when our memory usage shot up to 6 GB, it was time to start analyzing our keys and cleaning up.

We found keys related to functionality we longer support and decided to delete them. We ended up with script along the lines of:

import redis
conn = redis.StrictRedis()

def clear_keys(pattern):
    keys = conn.keys(pattern)
    pipe = conn.pipeline()
    for key in keys:
        pipe.delete(key)
    pipe.execute()

clear_keys('foo:bar:*')

Using this method, we were able to quickly clear out a few hundred thousand keys.

The next morning the site started going in and out intermittently. We were able to narrow this down to our redis server, which we realized was out of memory. This was surprising since we had cleared out a large chunk of data the night before.

It turns out the real culprit was our redis data persitence directory had run out of disk space. We use AOF persistence and were using BGREWRITEAOF via a cron job to compress the file. After issuing hundreds of thousands of deletes the night before, our AOF file had grown quite large. When the backround rewrite was issued and a temporary file was created, the disk ran out of space. With the disk full, our redis server had nowhere to flush the commands that needed to go into the AOF and ran out of memory.

To fix this, we had to delete the temporary file and stop redis, creating enough room for and forcing the AOF commands to flush to disk. We then moved the file to a larger volume, remounted the directory and started redis back up. This whole precedure, including the time it took for redis to load the data from the AOF file into memory, took around 45 minutes.

Looking back on the whole ordeal, we came up with these three guidelines going forward.

Lessons Learned

Redis Keys are not Forever

Whether it’s a timed expiry set directly in redis, or a logical expiry that explicitly deletes keys based on application conditions, all keys should eventually be deleted. Keys with a logical expiry should have a cleanup job to ensure they don’t get left behind. This helps keeps our memory footprint lean. At yipit, if we need data forever, redis is not the correct data store.

Namespace All The Keys

Aside from helping to ensure uniqueness, namespacing keys is a huge help for administration. Want to run analysis against our deal click tracking data? Just analyze the deal:clicks:* keys.

Use Separate Character Sets for the Static and Dynamic Components of Key Names

At yipit, most of our redis data maps to either a time or a row from our MySQL database. To keep administrative analysis easy, we use lowercase letters for the static parts of our keys and numeric characters for the dynamic parts of our keys. This makes it easy to do things like find the unique types of keys we have stored in redis.

Watch Your Disk Space

This should go without saying, but it was really the root cause.

The End

Hopefully you do a better job watching disk space and don’t need to do emergency maintenance. Either way, when you have to do some sort of administration, emergency or not, hopefully you can learn from our mistakes.

Zach Smith is the Technical Product Manager at Yipit. You can follow him on twitter @zmsmith and follow @YipitDjango for more django tips from all the yipit engineers.

Oh, by the way, we’re hiring.

Yipit Django Blog

Lessons Learned From a Redis Outage at Yipit