Productionizing storm is difficult

Today I got a call from a friend using storm in production:

‘Yo I restarted the supervisor process, can’t get the topology to pick up the new Kafka node while reading it. The producers are working fine, but there are no consuming nodes yet.’

First, there is a huge mismatch of expectations there. My friend didn’t know that you never really need to restart the supervisor nodes (most of the time… ) Second, when you deploy a storm topology nimbus creates a .ser file which contains all of the environment (.property files for example, log4j.XML etc). Then this file gets deployed to all supervising nodes that are going to execute your code.

The issue with his intuition of just killing the process consuming the properties file and reading from Kafka matches that of local development. It’s easy to simply kill -9 anything on your computer and… bam! restart it… (minus the implication of possibly data loss – read_from_start=true was given)

Obviously running a distributed computation is not trivial.

Here are a few reasons:

Restarting a process that interacted with external state (very typical) may not produce the same results when you need to replay records.

In fact, you can’t rely on the results being the same given the presence of arbitrary external state. period.

There is a huge value in the speedy processing of records and them being processed correctly with enough information for debugging. That’s the main reason for using a stream processor. Timely and correct processing.

Interacting with external state is hard. supa’ hard.
Usually all stream processors provide very different tools for reasoning about distributing work and processing work locally:
- Exceptions are hard to reason about in a distributed system: Do you restart the entire topology, do you restart the node, do you just crash and let the containerizer clean up your resources and reschedule your worker somewhere else in the cluster loosing data locality ? , etc. List is long
- Restarting remote processes have very different semantics than local procs. If the system has no way of detecting failure and replaying records then you just lost data. If the error propagates upstream and the source of data cannot replay data, you are f**ed. You just lost data which usually translates to real money.
- Deploying code in common streaming systems is really awkward. First you have to plan for downtime. The reason is that you usually can’t deploy code piecemeal. You need to deploy code in big chunks (DAGs) which contain the whole computational topology.
  
  The issue with this monolithic deployment is fairly obvious. Topologies are mostly long lived things, their uptime is months on a fast moving startup. Literally half year on bigger companies. This means that deploying a single component might introduce a bug in other components so deployment is a nightmare.
  
  Also, think about this bastard proposition. First you have to stop making money - pause your topology, leave those cpu’s idle. Then you upload new code and cross your fingers another team didn’t introduce a bug and hope that your unit,regression,integration tests got all the bugs. You then wait for about 5 - 10 minutes for the topology to be settled (i.e.: caches are hot, processing is humming along, no stack traces are registered, nagios hasn’t called you, your dashboard numbers are updating, etc..). If this process failed you have to roll back and cross your fingers again. Not to mention that your team must keep all deployed binaries properly versioned and in a repository that is easy to get to - not the case for most startups.
- I’ll stop at this note, but say something went wrong, where do you start? Do you go onto zookeeper and make sure that your topics are being updated? (in the case of the kafka driver for storm). Do you restart the process and increase the log level (assuming no jmx hooks). On a real note, debugging distributed systems is usually all about printf and tcpdump. Attaching a debugger to a running process is scary. Press the up arrow one more time than you intended and you just brought down part of the cluster accidentally. Don’t attach a debugger and you have almost no visibility into the process. jmx metrics help, the linux /proc helps, sending signals to processes help but some things are just logic errors on very specific code paths and this is impossible to debug without the old printf, pen & paper.

Luckily the solution was trivial. Restart the topology from nimbus (storm master node) and et voila’. New configs serialized and deployed.

Stay tunned for updates as we make progress on our attempt at solving some issues mentioned here.