Home Somewhere Right Now, a System is Failing (and how Erlang can help)
Post
Cancel

Somewhere Right Now, a System is Failing (and how Erlang can help)

Huh! Phone companies, if they had life, don’t you sometimes wish to hold their heads underwater until the bubbles stop? Yeah? Me too. But as much as we hate our phone companies, one basic truth always stands out and must be recognised: when you pick up your phone to make a call (to, say, Saul), it normally just works. Hmm, these telecom folks must be doing something right!

Their systems run for years without interruption, have stringent quality and reliability requirements, are upgraded in place without taking the systems offline or even losing state, are fault tolerant to both hardware failures and software errors, the systems handle a huge number of concurrent activities, are distributed, …etc., not forgetting the soft real-time properties. At the time these started, even Tim Berners-Lee was still dreaming up the web.

They achieved all the above by designing their systems using Erlang (the programming language, not that dimensionless unit used in telephony).

The Erlang runtime system is designed for systems with the following traits:

  • Distributed
  • Fault-tolerant
  • Soft real-time
  • Highly available, non-stop applications
  • Hot swapping, where code can be changed without stopping a system

Source: Wikidia

Every developer’s dream, right? No? What are you building?

I’m among the lucky few (thousand?) who get to work with Erlang professionally. Many people just hear about it, google it, read a paragraph or two and go “Ah, it’s purely functional and for telecom. We’re building web applications (or whatever cool stuff). I’m out of here.” Click! Click! Tabs closed. End of story. But Erlang is more than you think and not just for telecom.

Of course, this is not to say that Erlang (or its progeny Elixir) is a magic bullet (even though technically it is, but you know what I mean). There is simply no true way to write code but what I like most about Erlang is the error-handling mechanism, which, I’m going to try and explain. Erlang was designed for writing fault-tolerant systems, so error handling is probably the strongest thing in the language. Once you understand it, you’ll understand a lot about why things in Erlang are the way they’re.

As you probably already know, our community is divided about how to handle faults. One camp says we need to make systems fault-tolerant by catching exceptions, checking error codes, and generally keeping faults from turning into errors. The other camp, where Erlang belongs, says it’s futile to aim for fault tolerance. No matter what faults you try to catch and recover from, something unexpected will always happen. Therefore, this camp says “let it crash” so you can restart from a known good state. This sounds very scary to even think about, leave alone trying, in most other programming languages.

Restarting normally works due to the nature of bugs encountered in production systems, that is, ‘Bohrbug’ and ‘Heisenbug’. Read here.

A sequential language has only one process, so if that process crashes, you’re in deep trouble. So you have to take extraordinary measures to make sure that the process doesn’t crash (defensive programming). This leads to lots of unnecessary code to handle cases which in practice don’t occur often. However, we don’t do it that way in Erlang. We build things using large numbers of Erlang processes. We’re not so concerned if the individual processes crash. The basic assumption is that if a process crashes, we let some other processes fix up the error.

Principle of remote error handling

Before we look at the innards of Erlang, let’s firsts look at the principle of remote error handling because it’s very essential to some decisions made in Erlang.

Imagine you’re to build a fault-tolerant system. The first thing you’ll notice is you can’t build a fault-tolerant system with only one computer. Infant if a client ever asks you to build a fault-tolerant system with only one computer, you can either calmly explain your arse off or quickly run away (possibly by jumping through a window, you know, for maximal effect).

Suppose you blindly accepted to use one computer and the entire computer crashes. Not because of a programming error but due to a hardware error, maybe Godzilla struck at your office. You’re now lost and in deep shit. You simply can’t do fault-tolerant computations, at least not the ones Erlang was intended for, with one computer. You need at least two.

Two computers C1 and C2 observing each other

But if your client saw how miserably sad you looked while explaining and gave you two computers. Now you can arrange to do simple fault tolerance with the help of some kind of observation principle whereby C1 observes C2. If C1 crashes, C2 detects it and you must arrange that C2 can takeover whatever C1 was doing. Or more symmetrically by having C1 and C2 observe each other. So if C1 crashes, C2 observes the error and takes corrective action and vice versa.

The mechanism above should not differ depending on the number of computers we use to build our system. It should work the same way in a system of one process and one with thousands of processes. The same mechanism should be prevalent.

Errors in Erlang processes?

On a system that has got a large number of processes (represented with circles) like below.

Individual processes in a node

Suppose any of those processes crashes. Well, as illustrated above, nothing will happen. No other process in the system will know anything about the fact that a process just died. So if E dies, it will just die and nobody will know. Not cool, isn’t it? So as a result, the notion of a link was added to Erlang.

Linked processes within a node

Any two processes can be linked using the BIF (built-in function) link(A, B). If the two processes A and B are linked and A terminates for any reason, an error signal will be sent to B and the other way around. When a normal process receives an error signal, it will terminate if the exit reason is not normal. When it terminates, it also broadcasts an exit signal to its link set, that’s to say the set of processes that are linked to it.

Links are shown using dotted lines and their purpose is basically to define an error propagation path. This error propagation works across process boundaries.

Processes linked across two physically separate nodes

Now we’ve got a system of processes on two physically separated nodes. Node1 might be here in Kampala and Node2 on the other side of the ocean. It doesn’t matter where they are, the link mechanism works the same. Once the processes are linked together as shown, if an error occurs in B, it will be propagated to W and if an error occurs in X, D will be notified.

As mentioned earlier, the normal behaviour of a process, if it gets one of these error signals, is to also die. What happens in a system linked like the above is that if any one of the linked processes dies, then all of them will die. That sounds stupid, I know, and we can stop that by using system processes.

System processes

A double-ringed process, like Z above, show that it’s a system process.

A system process doesn’t die by default if it receives an error signal. The signal is instead converted into a message of the form {'EXIT', Pid, Why}. Where Pid is the identity of the process that was terminated, and Why is the reason for termination (exit reason). Why will be the atom normal if the process terminates without an error; otherwise, it describes the error. You can turn a normal process into a system process by evaluating the BIF process_flag(trap_exit, true).

This now gives us a hint on how we’re going to build fault-tolerant systems. We do this by having a large collection of processes that are linked together and we make some of them system processes which can handle the errors.

We essentially just need two BIFs to build a fault-tolerant system; link(A, B), which links two processes together and process_flag(trap_exit, true), which turns a normal process into a system process. You can also use monitors. A monitor is very similar to a link but is one-directional. That’s pretty much all that you need and these can be used in layers to build fault-tolerant systems. This is a large part of the famous OTP middleware Erlang is known for.

OTP Supervision tree

The earlier demonstrations were just graphical representations and show no structural relationship between the different processes. They just define the error propagation paths. What happens in an OTP system is that we structure processes into hierarchical supervision trees, yielding fault-tolerant structures that isolate failure and facilitate recovery.

Circles with double rings are supervisor processes and the rest are workers.

Supervisors are processes whose only task is to monitor and manage their children. They spawn processes and link themselves to these processes. They trap exits and receive exit signals, allowing them to take appropriate actions when something unexpected occurs. The actions vary from restarting a child to not restarting it, terminating some or all the children that are linked to the supervisor or even terminating itself.

Note that child processes can be both workers and supervisors.

Conclusion

In Erlang, we don’t care that much to make sure that the individual processes are always alive. We let them crash and allow other processes to detect those errors and then we try to correct these errors. And I don’t think we’re ever going to write programs in such a way that they’ll never crash. We just have to live with the fact that processes are going to crash and then we have to detect that remotely and try to correct them. The reason for the remoteness is based on the earlier argument that if an entire computer crashes, the only way to correct it is in a different computer. This is fundamental to Erlang and is something very strange to most people.

Whereas other programming languages say you should take trojan effort to make sure that your programs don’t crash, in Erlang the philosophy is that if something unexpected occurs, just crash immediately because we assume that things are always going to crash and therefore let some other processes have the responsibility of fixing up those crashes.

That’s it for now and I hope you now have some insight into Erlang. A future blog post may go into depth on how the Erlang runtime system comes with a convenient distribution mechanism that was designed into it from the start; not bolted on as an afterthought.

References

  1. The Zen of Erlang
  2. Official Erlang Website
  3. Programming Erlang (2nd edition)
  4. Learn You Some Erlang for great good!
  5. Making Reliable Distributed Systems in The Presence of Software Errors