Does your project need distributed Erlang?

You deployed your brand new Phoenix webserver installation, or a data processing pipeline using Broadway - or any other piece of infrastructure running on the Beam VM, and it’s all working just great. Global interpreter lock and threads corrupting each other’s memory are but a faint memory, api request latencies are nice and predictable, production bottlenecks are a breeze to identify and fix, what else could one want at this point? But then what about Erlang clustering, nodes communicating to each other, do you even need that or is this a leftover from the early days of Erlang and is there any value in that for a regular project, especially when running on k8s containers?

I would argue that distributed Erlang provides massive value even for simple projects and unlocks some design approaches that are not feasible on most other platforms, and will provide a couple of examples below to demonstrate the point.

Feature Flags

Feature flags is a handy technique that enables a low-risk approach for deployment of new features. For example, you integrated a new payment provider and want to enable it for 1% of users to test it out, while the other 99% will be using the old and trusty payment provider. In a couple of days, if everything works great, we can move everyone onto the new payment provider by changing the 1% to 100%, and then remove the feature flag logic with next deploy. Alternatively, if something goes wrong, we simply disable the feature flag and we are back to old behaviour instantly.

Feature flag libraries are available for most languages, including Elixir: fun_with_flags. At the application code level, a feature flag logic is a conditional statement:

FunWithFlags.clear(:alternative_implementation)
FunWithFlags.enable(:alternative_implementation, for_percentage_of: {:time, 0.05})

def foo(bar) do
  if FunWithFlags.enabled?(:alternative_implementation) do
    new_foo(bar)
  else
    old_foo(bar)
  end
end

The configuration data for feature flags is persisted in Redis or database and this is where it becomes interesting. Every time a feature flag is checked, the data has to be retrieved from storage. This can be many millions/billions times per day, each costing us a network round trip, and extra load on the db engine. The absolute majority of those checks will yield identical result, which only changes when you toggle the flag manually, e.g.:

FunWithFlags.disable(:alternative_implementation) 

All those round trips can lead to quite a bit of waste, for instance 10 million checks with 1 ms round trip time is almost 3 hours of additional delay. And there is no way to get around it in, say, Ruby or Java world. However, with distributed Erlang setup, we can cache feature flags locally in each cluster node and update those values automatically whenever they are changed on one node. We go from 3 hours of wait time for all those remote calls to something like 150 ms for 10 million of local memory checks.

Stateful Services

I recently worked on an optimization for a poker tournament service where the initial implementation would read the db data for all players on each hand finish event, compile a list of players and their stack sizes, and then send a broadcast message to all players' clients. For a decent size tournament with 1,000 players that means a db fetch of 100-200 or more records (one for each poker table, each record is ~5-10 kilobytes of json data). A hand in online poker typically lasts about a minute, which means we are doing this massive db fetch several times per second, at least during the initial stages of a large tournament.

I was aware that the ranking algo was done in a rush to push it out of the door, and made a mental note at the time to revisit it later. It was then part of a Ruby-based service, and I recently ported it to Elixir where for every running tournament we start a GenServer and with every hand finish event a message is sent to this process which allows it to maintain the up to date ranking table for the tournament. The db is only read once, - on the first hand finish, after that it’s only in-memory map manipulation. Decent api latency improvement and savings on db and network traffic.

As I was working on this, I couldn’t help but wonder why we didn’t do the efficient implementation from the start, as the incremental ranking update idea is quite obvious. But then I imagined the extra hoops I would have to jump through to implement this in a Ruby/k8s world:

This is a lot of boilerplate just to get to a less optimal version of our stateful Elixir service (this Ruby service would still have to read the state from DB on every request, just not nearly as much as in the initial implementation). And with the additional k8s pods, a replica set and a service (at the very minimum), - that’s adding to overall complexity of the kubernetes cluster and the operational complexity of the environment. All this probably explains why the developer working on the initial implementation opted not to go this route and chose the least efficient (but fastest to deliver) implementation.

Instead, the tournament ranking service is now tucked away in the Erlang cluster, - we don’t need to be actively aware of it on a daily basis. We use libcluster to automatically maintain cluster nodes and horde to supervise these GenServer processes. If a node goes down (e.g. during the deployment), its GenServer processes are automatically started on another node, their states are initialized from the database, and then kept up to date by incremental updates. When the tournament is finished, the code asks Horde’s DynamicSupervisor to terminate the ranking process for the tournament and that’s the end of its life cycle.

This is a nice example of how using distributed Erlang leads to more robust and simpler design patterns. The Dashbit blog has a really nice post expanding some more on this theme.