Introducing Stud
This post was written by Jamie from Bump's server team.
So a few weeks ago, when the zombie army lurched its way toward our servers, a long-neglected scaling issue likewise rose from the dead.
Like all responsible companies, we strive to keep our users' data protected; so, we tunnel our proprietary socket protocol in TLS. Unfortunately, our use of other-than-HTTP immediately removes the usual candidate for TLS/SSL termination--nginx--from the running.
So, when we were designing our backend systems last summer, we grabbed the only serious open source contender for adding TLS to an arbitrary socket: the venerable stunnel project. Stunnel has three "threading models": ucontext, threads, and processing.We decided to try each and see what worked best.
We had high hopes for ucontext, since the low memory overhead of setjmp/longjmp-based coroutines would work really well for our particular application. As opposed to something like HTTP, which is typically comprised of many short-lived, active connections, Bump holds open low-bandwidth connections to a very high number of clients at once. So minimizing the memory penalty of each connected client was important to us.
But the ucontext threading module of stunnel performed surprisingly poorly. We could saturate one core with only ~1k mostly-idle concurrent connections--and there was no graceful approach for utilizing more than one core without introducing another load balancing layer. This may very well have been due to some of the flaws the folks at RethinkDB recently documented on their blog
Next, we tried using the pthread-based threaded mode. As programmers, we trust threads about as far as we can throw them--but admit that written skillfully and very carefully, threaded programs can perform excellently. Unfortunately, when put under load, stunnel threw all kinds of nasty assertions and segfaults in threaded mode.
So prefork was looking pretty good at this point. We happened to be using machine with a ridiculous amount of memory (north of 80GB), and copy-on-write semantics meant that each child process would only use around 2MB of resident memory. Furthermore, under the concurrency estimates we had at that time (as well as 12-month projections), the OS scheduler would handle things quite nicely--and to top it off, fork-based multiprocessing is an extremely robust way to run servers, as Apache, PostgreSQL, Unicorn, etc, have proven time and time again.
All in all, we were feeling pretty pleased with ourselves until those pesky zombies *octupled* our concurrent connections overnight.
(If you were wondering, by the way, where process scheduling breaks down--at least, on a mostly-stock Linux kernel--the magic number seems to be around 10-15k processes. By 20k, you are basically doing no actual work--only context switching. And fork() can take upwards of 5 seconds.)
I suppose we *could* have just fired up lots of servers with lots of RAM and spent our way out of the problem. But the whole situation was starting to get a little ridiculous. So we decided to try our hand at a solution.
The result is stud. Stud is the Scalable TLS Unwrapping Daemon (and yes, we worked very hard to make that acronym fit). It takes the nginx model of using one process per CPU core and then doing asynchronous I/O within each process. It has a low memory overhead per secure connection and it minimizes heap allocations to the extreme. It's easy to use and easy to configure.
Since deploying stud over a month ago, memory usage on our load balancing machines is down tenfold. Load is down 80%, and TLS handshake times are back under 20ms consistently.
Here's the load on one of our TLS termination servers running stud, handling over 3,000 secure connections (including ~50 handshakes a second):
someserver:~$ netstat -na | grep ':2000' | grep ESTAB | wc -l 3337 someserver:~$ cat /proc/loadavg 0.67 0.66 0.61 3/415 18476 someserver:~$ cat /proc/cpuinfo | grep processor | wc -l 16
We decided to open source it in case it helps anyone else out there keep the zombies at bay--it's a particularly natural partner for haproxy, which is our production deployment. Feel free to drop me a line with any feedback, file bugs on github, or submit pull requests--we'd love improvements!
Want to work with Jamie and the rest of the Bump team? We are hiring: http://bu.mp/jobs.