About two weeks we had an outage on our main redis server. In dealing with that issue, we came up with some guidelines to ensure easier administration and lessen the chance of a future outage. Keep in mind that here at yipit, we use redis mostly as a datastore for denormalized or precalculated data.
Redis Outage
Sometime at the beginning of the week, we noticed our memory usage climbing. We use an m1.large EC2 instance with 7.5 GB of memory, so when our memory usage shot up to 6 GB, it was time to start analyzing our keys and cleaning up.
We found keys related to functionality we longer support and decided to delete them. We ended up with script along the lines of:
Using this method, we were able to quickly clear out a few hundred thousand keys.
The next morning the site started going in and out intermittently. We were able to narrow this down to our redis server, which we realized was out of memory. This was surprising since we had cleared out a large chunk of data the night before.
It turns out the real culprit was our redis data persitence directory had run out of disk space. We use AOF persistence and were using BGREWRITEAOF via a cron job to compress the file. After issuing hundreds of thousands of deletes the night before, our AOF file had grown quite large. When the backround rewrite was issued and a temporary file was created, the disk ran out of space. With the disk full, our redis server had nowhere to flush the commands that needed to go into the AOF and ran out of memory.
To fix this, we had to delete the temporary file and stop redis, creating enough room for and forcing the AOF commands to flush to disk. We then moved the file to a larger volume, remounted the directory and started redis back up. This whole precedure, including the time it took for redis to load the data from the AOF file into memory, took around 45 minutes.
Looking back on the whole ordeal, we came up with these three guidelines going forward.
Lessons Learned
Redis Keys are not Forever
Whether it’s a timed expiry set directly in redis, or a logical expiry that explicitly deletes keys based on application conditions, all keys should eventually be deleted. Keys with a logical expiry should have a cleanup job to ensure they don’t get left behind. This helps keeps our memory footprint lean. At yipit, if we need data forever, redis is not the correct data store.
Namespace All The Keys
Aside from helping to ensure uniqueness, namespacing keys is a huge help for administration. Want to run analysis against our deal click tracking data? Just analyze the deal:clicks:* keys.
Use Separate Character Sets for the Static and Dynamic Components of Key Names
At yipit, most of our redis data maps to either a time or a row from our MySQL database. To keep administrative analysis easy, we use lowercase letters for the static parts of our keys and numeric characters for the dynamic parts of our keys. This makes it easy to do things like find the unique types of keys we have stored in redis.
Watch Your Disk Space
This should go without saying, but it was really the root cause.
The End
Hopefully you do a better job watching disk space and don’t need to do emergency maintenance. Either way, when you have to do some sort of administration, emergency or not, hopefully you can learn from our mistakes.
Zach Smith is the Technical Product Manager at Yipit. You can follow him on twitter @zmsmith and follow @YipitDjango for more django tips from all the yipit engineers.
We were convinced of the many, manybenefits of split-testing. But in true lean startup fashion, we wanted to implement a MVP first to better understand what a more feature-rich system might look like.
While Eric Ries has also suggested simple ways to split test, we were able to do it using one line of code and a SQL query.
How we did it
Add one line of code
12
ifnotmod(user.id,2):# [do something new]
Let the experiment run
Analyze the test and control groups on a key metric you track (e.g. % clicker) since the split test was started. This is easily done in MySQL*:
Compare the numbers for both the control and test group to see how you did!
Why does this work?
Random: if you’re like most companies, there’s no fundamental difference between users whose ids are odd, even or divisible by 10, so using user_id controls for any potential biases
Consistent: the user_id isn’t going to change once a user is registered so we can ensure a consistent experience for tests that might require a couple days to show results.
Deterministic: When you have a lot of users, you don’t want to have to store which users are in which split test. Splitting users by modulus makes it easy to analyze the results through querying MySQL.
Controllable: By changing the modulus, you can set the percentage of users you want to be in the test group. Want to test a risky idea? Use not mod(user.id, 10) to only experiment on 10% of your users.
It’s been 5 months and we’ve run over 50 split tests, killed some very expensive potential features, and saw a simple subject line optimization bump retention by 15%. Since then, we’ve built a more robust system, but that’s another post.
We’ve also stopped arguing for hours about features and now argue about which keyboard layout is better.
If you follow the steps above, you should be split-testing in less than a week. We were.
Here at Yipit we love using Github. It is a great way to manage our public and private repos and hand off the grunt work of git management. Even better is that we get to use it with Chef to deploy code to our servers on Amazon EC2.
It is a pretty straightforward process for us to start a server and get repository access:
3. Reach out to the Github API and register that SSH key as a Deploy Key with our repository.
TIP: When registering your key register it with a name that can easily be read by humans and machines like yipit_prod_web1 (i-1234abcd). This will make it easier to manage them in code or from the web interface.
Great! Now everything works perfectly. You can easily deploy to your machines with fabric, Chef, or on the command line. It doesn’t even have to be kicked off by a developer so you can do it from a central deploy server.
Cool, we have a few other apps under active development that we want to pull from Github so lets add our key to a few more private repos and we can just …errr…ummm
Well, that’s not good. How do we get access to multiple private repos then? Let’s ask Github Help
The Github Way(s)
1. SSH Agent Forwarding
Do deploys from your local machine by forwarding your SSH credentials when logging in to each server. Works for code rollouts via fabric from a developers machine, but not if you want to automate your deployments.
2. Deploy Keys
The method discussed above but we know that Deploy Keys are globally unique across your repositories. Github makes sure to note the downsides of deploy keys as well. Any machine with a deploy key has full access to your source control and the keys will not have a passphrase.
3. Machine Users
Give each machine a user account on Github and authenticate as if they are a person. Not a very automatable or scalable solution.
The Better Way
The options Github provides don’t seem to work very well for our requirements:
Automated (no human interaction)
Each machine should be able to access multiple private repositories.
We get close with Deploy Keys but we are limited to a single repository per SSH key. The solution? Give each machine multiple SSH keys.
Building a better Deploy Key
Using multiple SSH keys can get messy, fast. This means we will want to build an abstraction around it so we don’t directly interface with the complexity (enter Chef or your own homegrown solution). But first we should explain what we are going to do inside our magical abstraction.
1. Creating a new SSH Key
Now we need to figure out how to automatically create new SSH keys and use them when interacting with git.
We can easily create new SSH keys and add them to ~/.ssh/ with the proper permissions.
This will create a new 4096 bit RSA public/private keypair with a custom name so that we don’t overwrite our default keys.
NOTE: We are creating this key without a passphrase since our automation will not have the ability to ask for human input.
2. API Access
To automate this process we will need to interact with the Github API instead of using the web dashboard. There are some great Github API libraries out there that we could use, but for now we want to keep it simple. Simple as in Bash:
Chef Note: We use a Ruby version of this script inside a custom LWRP as an interface that can be used across recipes. We swallow errors from trying to add the same key to the same repo although the Right Way(TM) is to remember if you have added the key already and just skip the step on subsequent runs.
Now we can easily add new keys to any repo we control. This is a good start, but it doesn’t solve the issues with using multiple SSH keys.
3. Making Git Behave
We need Git to use these new keys. If you are using SSH keys with Git, it will default to your id_rsa keypair (or DSA if you prefer). If you know the server name you will be connecting to you can specify a key in your ssh_config file, but when contacting Github all the servers look the same, regardless of repository. We need a different solution.
Git allows you to specify a custom script to run when contacting a remote system (git-fetch or git-push). From the git documentation:
1234567
GIT_SSH
If this environment variable is set then git fetch and git push will use this command instead of ssh when they need to connect to a remote system. The $GIT_SSHcommand will be given exactly two arguments: the username@host (or just host) from the URL and the shell command to execute on that remote system.
To pass options to the program that you want to list in GIT_SSH you will need to wrap the program and options into a shell script, then set GIT_SSH to refer to the shell script.
Usually it is easier to configure any desired options through your personal .ssh/config file. Please consult your ssh documentation for further details.
Unfortunately this isn’t very explicit about what you really need to do, but it is pretty straightforward once you have an example.
-q: Quiet. We prefer that our SSH connections aren’t extremely verbose when we run them from Chef.
-2: Force SSHv2. Version 1 has issues and is only included for backwards compatability. Github is on top of their updates (and they were founded after SSHv2 was already standard) so we disable the ability to degrade to SSHv1.
-o "StrictHostKeyChecking=yes": We want to ensure SSH is forcing the Remote Host Key validation since it will help prevent MITM attacks. This can be tricky since we need to make sure we have a solid ~/.ssh/known_hosts file. Read more on this option here. To retrieve Github’s server fingerprint we can run:
12345
# Get the key and output it with the server address hashedssh-keyscan -H github.com
# Get the key and output it with the server address in plaintextssh-keyscan github.com
Be aware that this command could also be affected by MITM attacks so it is best to validate this out of band (multiple locations, different ISPs etc) before copying it into your known_hosts file. We will use Chef templates to manage placing this on our servers as we migrate to strict key checking. You should have a good way to update this if Github switches keys since it will break your rollouts.
-i /path/to/key: This option allows you to specify a private key file to use when connecting with SSH. This is what allows us to map keys to repositories on Github, sidestepping the uniqueness constraint on RSA keys across repositories.
We place this script somewhere safe (alongside our ssh keys works for us) and make sure it is executable.
12345678
$ ll ~/.ssh/
-rw-r--r-- user user authorized_keys
-rw-rw-r-- user user known_hosts
-rw------- user root id_rsa
-rw-r--r-- user root id_rsa.pub
-rw------- user root my_repo_rsa
-rw-r--r-- user root my_repo_rsa.pub
-rwx------ user user my_repo_ssh_wrapper.sh
Once we have this script setup, whenever we are using git to interact with my_repo we need to set GIT_SSH.
A simple script to update a repository from the command line:
123456789
#!/bin/bashcd my_repo
# Set GIT_SSH for this terminal sessionexport GIT_SSH=~/.ssh/my_repo_ssh_wrapper.sh
# Run our git commandsgit pull
4. Automation
Now we know how to add keys to Github, create new keys, and force git to use these new keys. To automate this we just need to stitch these pieces together. In our case we use Chef and this is done very simply:
Create a new SSH key
Use our LWRP to add this key to Github via the API
Template our shell script for use with GIT_SSH
Use the Git deploy provider to download the repo and sync changes during Chef runs. Thankfully the provider has support for using GIT_SSH commands. A sample git block
An additional LWRP could be built around the first 3 steps to make it very simple to use.
TIP: We should switch to using git clone with the depth parameter specified so that we waste less time on our initial checkout when we have no need for detailed history on the machines.
5. Cleaning up old keys
Well, we are all set now aren’t we? Not quite, we still have the issue of cleaning up old keys from dead machines. We can do this through the web console with some painful window swapping to check which servers still exist, but this sucks. Don’t worry, there is an easy fix. We just need a little more automation:
Contact the AWS EC2 API and get a list of all of our staging and production instance-ids
Contact the Github API and get a list of all of the deploy key names for a our repository
Get a list of every instance ID that is in a deploy key name that is not in the AWS instance-id list (Remember when we said to make a server name that can be easily read by both man and machine?).
Delete the deploy keys that contain these extraneous instance-ids
This script is repeatable so you can have it run regularly off a cron job and/or at the tail end of your instance shutdown code so you don’t have to worry about these keys floating about.
6. The Future
Deploying from Github for our large applications will not always be a good solution as we grow. We are investigating alternative methods for dissociating source control distribution from our release process.
The second Yipit Hackathon was a success in every way. Besides being a blast, it brought our team closer together, generated impactful ideas, and featured cool prizes like Romo.
Teams consisting of both developers and business people were given 36 hours to complete their objective: build something useful for Yipit that is awesome enough to win the votes of the other teams. Most of the teams did so well that their projects are either being rolled out, split tested or used internally. Check them out below.
Fortune Teller | Vinicius Vacanti
At Yipit, we use in-house technology to track and analyze key metrics in real time. Vinicius Vacanti, one of our founders, built a tool to give a daily forecast and track progress for each metric.
Deal Machine | Steve Pulec and Dave Tomback
“The No Whammies” built a deal machine that packages deals for a fun day based on where you are and what you want to do. The team integrated with Stripe to handle payments and even offered a package discount.
Deal Voting | Fabio Costa and Gabriel Falcao
Yipit curates deals by providing Yelp reviews, a feature that grew out of the first hackathon.
Team “Os-tripaseca” expanded on this concept by allowing users to upvote deals they like and downvote deals they dislike for collaborative filtering.
Democratic Lunch | Jim Moran and Nistha Tripathi
“Yipit’s Winning Ways” built an email-based voting system for electing the two restaurants to order from on an aggregation of Seamless, GrubHub and Delivery.com.
Remote Collaboration | Adam Nelson and Lincoln de Sousa
Up until recently, some of our developers were working remotely. In an effort to facilitate collaboration, Team “World do Mondo” built a technology to share screens and terminals to code simultaneously.
Getting to Know You | Mingwei Gu and David Sinsky
The Yipit family is always growing, and keeping startup culture alive is both important and challenging. Team “Step by Step” built a game that increases and tests your knowledge of your coworkers’ interests and tastes.
3rd Place | Web UI Revisited | Nitya Oberoi and Ben Plesser
Team “Bet-ya” re-envisioned our web interface with better navigation and grouping deals by popularity. Many of the concepts they came up with are now being used across our different products.
2nd Place | Automated Split Testing | Alice Li and Zach Smith
We are scientists. We constantly split test our ideas. Team “A to Z” built a system to automatically launch, track and rollout our experiments. All of our features, including many of these hackathon projects, are being split tested using this system!
1st Place | Dynamic Deal Recommendation | Suneel Chakravorty and Henry Xie
“The Sick Bros” won the hackathon by building a machine learning algorithm that further personalizes our deal recommendation system by adapting to user behavior.
With no strings attached, developers are empowered to take on more ambitious challenges. Many other companies, such as 37signals, are experimenting with making Hackathons a part of their regular process. So are we.
To a lot of non-developers, learning to code seems like an impossibly daunting task. However, thanks to a number of great resources that have recently been put online for free - teaching yourself to code has never been easier.
I started learning to code earlier this year and can say from experience that learning enough to build your own prototype is not as hard as it seems. In fact, if you want to have a functioning prototype within two months without taking a day off work, it’s completely doable.
Below, I’ve outlined a simple path from knowing nothing about software development to having a working prototype in eight weekends that roughly mirrors the steps I took.
Introduce yourself to the web stack (10 minutes):
The presence of unfamiliar terminology makes any subject seem more confusing than it actually is. Yipit founder/CEO Vin Vacanti has a great overview of some of the key terms you’ll want to be familiar with in language you’ll understand.
Get an introductory grasp of Python and general programming techniques (1 weekend):
Learn Python the Hard Way. Despite the title, the straightforward format makes learning basic concepts really easy and most lessons take less than 10 minutes. However, I found that the format didn’t work as well for some of the more advanced topics so I’d recommend stopping after lesson 42 and moving on.
Google’s python class. Read the notes and / or watch the videos and do all of the associated exercises until you get them right - without looking at the answers. Struggling through the exercises I kept getting wrong was the best learning experience and I would have learned far less if I had just looked at the answers and tried to convince myself that I understood the concepts.
These two resources are somewhat substitutable and somewhat complementary. I recommend doing the first few lessons from both to see which you like better. Once you’ve finished one, skim through the other looking for concepts you aren’t fully comfortable with as a way to get some extra practice.
Get an introductory understanding of Django (1 weekend):
The first time I went through the tutorial I inevitably ended up just following the instructions step-by-step without really understanding what each step did since everything felt so new.
The second time through I wasn’t as focused on the newness of the concepts was better able to focus on understanding how all the parts work together.
Get a deeper understanding of Python / general programming concepts (2-4 weekends):
Udacity’s intro CS class. Udacity’s courses are generally 7 session classes (2-3 hours per session) that you can at your own pace. (I’m a huge fan of Udacity’s pedagogy and recommend the intermediate programming class or the web development class as follow-ups to this two-month curriculum.)
Again, I would sample each and see which you like the best. I ended up doing both but that was probably overkill.
Practice building simple web applications (1 weekend):
Work through a few of the exercises in Django by example. These exercises don’t hold your hand quite as much as the Django tutorial but they still provide a fair bit of guidance so I found it to be a nice way to start taking the training wheels off.
That’s it. Eight weekends (or less) and you’ve gone from zero to a functioning prototype. Not so daunting after all is it?
Next Steps:
It goes without saying that there is a huge difference between the relatively cursory amount of knowledge needed to build a simple prototype (the focus of this post) and the depth of knowledge and experience needed to be a truly qualified software engineer.
If you want to learn all that it takes to build modern web applications at scale, getting professional web development experience at a fast-growing startup like Yipit is a great next step.
If you’re smart, hard-working and passionate about creating amazing consumer web experiences drop us a line at jobs@yipit.com - we’re always looking for great people to join our team.