Lessons Learned From a Redis Outage at Yipit
About two weeks we had an outage on our main redis server. In dealing with that issue, we came up with some guidelines to ensure easier administration and lessen the chance of a future outage. Keep in mind that here at yipit, we use redis mostly as a datastore for denormalized or precalculated data.
Redis Outage
Sometime at the beginning of the week, we noticed our memory usage climbing. We use an m1.large EC2 instance with 7.5 GB of memory, so when our memory usage shot up to 6 GB, it was time to start analyzing our keys and cleaning up.
We found keys related to functionality we longer support and decided to delete them. We ended up with script along the lines of:
1 2 3 4 5 6 7 8 9 10 11 |
|
Using this method, we were able to quickly clear out a few hundred thousand keys.
The next morning the site started going in and out intermittently. We were able to narrow this down to our redis server, which we realized was out of memory. This was surprising since we had cleared out a large chunk of data the night before.
It turns out the real culprit was our redis data persitence directory had run out of disk space. We use AOF persistence and were using BGREWRITEAOF
via a cron job to compress the file. After issuing hundreds of thousands of deletes the night before, our AOF file had grown quite large. When the backround rewrite was issued and a temporary file was created, the disk ran out of space. With the disk full, our redis server had nowhere to flush the commands that needed to go into the AOF and ran out of memory.
To fix this, we had to delete the temporary file and stop redis, creating enough room for and forcing the AOF commands to flush to disk. We then moved the file to a larger volume, remounted the directory and started redis back up. This whole precedure, including the time it took for redis to load the data from the AOF file into memory, took around 45 minutes.
Looking back on the whole ordeal, we came up with these three guidelines going forward.
Lessons Learned
Redis Keys are not Forever
Whether it’s a timed expiry set directly in redis, or a logical expiry that explicitly deletes keys based on application conditions, all keys should eventually be deleted. Keys with a logical expiry should have a cleanup job to ensure they don’t get left behind. This helps keeps our memory footprint lean. At yipit, if we need data forever, redis is not the correct data store.
Namespace All The Keys
Aside from helping to ensure uniqueness, namespacing keys is a huge help for administration. Want to run analysis against our deal click tracking data? Just analyze the deal:clicks:*
keys.
Use Separate Character Sets for the Static and Dynamic Components of Key Names
At yipit, most of our redis data maps to either a time or a row from our MySQL database. To keep administrative analysis easy, we use lowercase letters for the static parts of our keys and numeric characters for the dynamic parts of our keys. This makes it easy to do things like find the unique types of keys we have stored in redis.
Watch Your Disk Space
This should go without saying, but it was really the root cause.
The End
Hopefully you do a better job watching disk space and don’t need to do emergency maintenance. Either way, when you have to do some sort of administration, emergency or not, hopefully you can learn from our mistakes.
Zach Smith is the Technical Product Manager at Yipit. You can follow him on twitter @zmsmith and follow @YipitDjango for more django tips from all the yipit engineers.
Oh, by the way, we’re hiring.