In my previous blog post titled Somewhere Right Now, a System is Failing (and how Erlang can help), I wrote about how Erlang approaches error handling and fault tolerance as compared to most other languages. At the end of it, I did mention that a future blog post could go into depth on how the Erlang runtime system comes with a convenient distribution mechanism designed into it from the start. Well, that future is here. But unfortunately, I may have lied about going into depth, as distributed programming is not for the faint of heart. Just like Fred Hebert said:
See, distributed programming is like being left alone in the dark, with monsters everywhere. It’s scary, and you don’t know what to do or what’s coming at you. Bad news: distributed Erlang is still leaving you alone in the dark to fight the scary monsters. It won’t do any of that kind of hard work for you. Good news: instead of being alone with nothing but pocket change and a poor sense of aim (making it harder to kill the monsters), Erlang gives you a flashlight, a machete, and a pretty kick-ass moustache to feel more confident (this also applies to female readers).
The term Distributed here refers to how systems are clustered together and interact with each other. A distributed Erlang system consists of several Erlang runtime systems (nodes) communicating with each other. Message passing between processes at different nodes, as well as links and monitors, are transparent when PIDs are used. Registered names, however, are local to each node. This means that the node must be specified as well when sending messages using registered names.
This is a demo-driven blog post so you need a working Erlang installation on your system to follow along. Use a package manager to install (or however you like—this is a free country, after all). Omitting installation steps allows me to pack in more useful examples and discussion of concepts, which is what you really came for anyway, right?
Setting up an Erlang cluster
Erlang nodes in a cluster are all given either short or long names to identify them. To use short or long names, start the Erlang VM with erl -sname short_name@domain
and erl -name long_name@some.domain
respectively.
Windows users should use
werl
instead of theerl
.
Let’s start two nodes on the same host (open two terminal windows/tabs).
1
2
3
$ erl -sname dog
...<snip>...
(dog@localhost)1>
1
2
3
$ erl -sname cat
...<snip>...
(cat@localhost)1>
I chose to start the two nodes on the same host and, therefore, they’re using a default cookie in the
~/.erlang.cookie
file. You can start them on different machines on the same LAN using the long name flag-name
as inerl -name dog -setcookie abc
(whereabc
is the cookie string) or even on different hosts in the internet, this time usingerl -name dog -setcookie abc -kernel inet_dist_listen_min Min inet_dist_listen_max Max
(whereMin
andMax
define a range of ports to be used for distributed Erlang).
How about connecting the cat to the dog (maybe to ally)?
1
2
(dog@localhost)1> net_kernel:connect_node(cat@localhost).
true
This should return true
as shown (meaning you’re in distributed mode, baby). A false
means the nodes can’t connect. You either have a local network problem or you mistyped something and therefore are trying to connect to a non-existent node.
To see the list of connected nodes, use the BIFs node()
and nodes()
.
1
2
(dog@localhost)2> [node()|nodes()].
[dog@localhost,cat@localhost]
Let’s just register the two Erlang shell processes with the node names.
1
2
(dog@localhost)3> register(dog, self()).
true
1
2
(cat@localhost)1> register(cat, self()).
true
Let the cat
sends a “Hello” message to the dog
.
1
2
(cat@ferdmbp)2> {dog, dog@localhost} ! {"Hello!", from, self()}.
{"Hello!",from,<0.90.0>}
The dog
looks for “Hello!” messages in the mailbox and replies with “Oh, hi!”.
1
2
(dog@localhost)4> receive {"Hello!", from, CatShell} -> CatShell ! "Oh, hi!" end.
"Oh, hi!"
To confirm that the cat
got a reply, flush the content of its mailbox.
1
2
3
(cat@localhost)3> flush().
Shell got "Oh, hi!"
ok
You can add as many nodes to the cluster as you like and send any Erlang data structure transparently from one node to another. That’s basically it for Distributed Erlang but let’s take this a tad further, shall we?
Monitors and links
The concept of links and monitors still prevails even if processes are in nodes on physically separated machines. The dog
can, for example, monitor the cat
as:
1
2
3
4
5
6
7
8
9
10
(dog@localhost)5> process_flag(trap_exit, true).
false
(dog@localhost)6> link(CatShell).
true
(dog@localhost)7> erlang:monitor(process, CatShell).
#Ref<0.3668919864.3736862723.117013>
(dog@localhost)8> erlang:monitor_node('cat@localhost', true).
true
(dog@localhost)9> flush().
ok
On line 5, we make the dog
shell process start trapping exits and link it to the cat
process, whose PID was bound to the CatShell
variable earlier in line 4. Then, we set up the dog
to monitor the cat
on line 7, as well as monitor the node 'cat@localhost'
on (line 8). Finally, we flush the dog
’s mailbox (line 9) just to make sure there’s no message in there at that very moment.
Now, go to the cat@localhost
node and kill it:
1
(cat@localhost)4> halt().
The dog
shell process should receive 'EXIT'
, 'DOWN'
, and nodedown
signals for the link and the two monitors.
1
2
3
4
5
(dog@localhost)10> flush().
Shell got {'EXIT',<9385.90.0>,noconnection}
Shell got {'DOWN',#Ref<0.3668919864.3736862723.117013>,process,<9385.90.0>,noconnection}
Shell got {nodedown,cat@localhost}
ok
This same principle of detecting faults is used when architecting systems using Erlang/OTP. In the simple demo above, no corrective action was taken by the dog
after learning of the cat
’s death but it could decide to get a new friend (launch a new 'cat@localhost'
node) or transfer the cat
process or application to an existing redundant node, ..etc.
Distributed OTP Applications
OTP lets us define takeover and failover mechanisms, a convenient distribution mechanism for migrating applications across nodes. It won’t hold you by the hand all the way but can act as a temporary solution until a more complex solution can be put in place.
Distributed applications are managed by the distributed application controller. An instance of this process is found in the kernel supervision tree on every distributed Erlang node.
Failing and Taking Over
A failover means restarting an application somewhere other than where it went down (stopped running). The app is run on a main computer or server, and if it fails, it is moved to a backup one.
Taking over is when a dead node having a higher priority comes back from the dead and decides to run the application again. This is achieved by gracefully shutting down the backup application and starting the main one instead.
Imagine, if you may, a system with four nodes designed to run an application. Only the first node JC1
is running the application and the rest of the nodes are backup nodes in case JC1
dies. Let’s also assume JC2
and JC3
have the same precedence so only one of them can run the application at a time.
If JC1
fails (shown by the red colour), for a brief moment nothing will be running. After a while, either JC2
or JC3
(JC3
in this case) realises this and takes over running the application. This is called a failover.
If JC3
fails, the application will be moved to JC2
since it has higher precedence than JC4
.
For some reason, JC2
also fails, and the application is now moved to JC4
.
JC1
coming back up moves the application from JC4
back to JC1
. This is called a takeover.
Because only showing you diagrams isn’t the most entertaining thing to do, and because pretending to run and distribute an application that doesn’t exist is even worse, I’ve extracted an application for us to use. The working and implementation details of the application are not important for this blog post. Just clone it.
Specifying Distributed Applications
Our cluster has four nodes, jc1@localhost
, jc2@localhost
, jc3@localhost
, and jc4@localhost
. All that’s left is to create a configuration file that defines which nodes are the main ones and which ones are backups. The configuration is already included in the config/sys.config
file in the repository but let me just show the distribution-specific part here.
1
2
3
4
5
6
7
{kernel, [
{distributed, [{job_center, 1000, [jc1@localhost, {jc2@localhost, jc3@localhost},
jc4@localhost]}]},
{sync_nodes_mandatory, [jc1@localhost]},
{sync_nodes_optional, [jc2@localhost, jc3@localhost, jc4@localhost]},
{sync_nodes_timeout, 15000}
]}
This just sets the distributed
environment variable in the kernel application. The first number you see 1000
show the time in milliseconds to wait for the node to come back up. The node precedence is defined as: [jc1@localhost,{jc2@localhost,jc3@localhost},jc4@localhost]
where jc1
has the highest precedence followed by both jc2
and jc3
(which have the same) and should both jc2
and jc3
fail, that’s when jc4
come in.
The {sync_nodes_mandatory, NodeList}
environment variable defines nodes the distributed application controller must synchronize with; the system will start only if all of these nodes are started and connected within the timeout value defined in sync_nodes_timeout
variable (15000
milliseconds).
Did you clone the repository yet? No? Let’s clone and compile the code (hoping you at least installed Erlang).
1
2
3
$ git clone https://github.com/lgmfred/jc.git
$ cd jc
$ erl -make
Let’s start a node, shall we? Any node you like apart from jc1
and wait for 15 seconds. I chose jc3
.
1
$ erl -sname jc3@localhost -config config/sys -pa ebin/
Yikes! Joker strikes! Call Batman! This is pretty ugly—and definitely not the kind of error you would want your children to look like. The node waited 15 seconds (time set in sync_nodes_timeout
) for jc1
to come up and then crashed with that long, hairy, and complicated error.
Open four terminal tabs/windows and start the first three nodes as shown:
1
$ erl -sname jc1@localhost -config config/sys -pa ebin/
1
$ erl -sname jc2@localhost -config config/sys -pa ebin/
1
$ erl -sname jc3@localhost -config config/sys -pa ebin/
The nodes will wait again for 15 seconds for the optional node jc4
to come up. If it doesn’t, the nodes will start regardless. After the three nodes above have all started, you can then start jc4
and see that it immediately starts without waiting since the mandatory node jc1
is already started.
1
$ erl -sname jc4@localhost -config config/sys -pa ebin/
Now that all nodes have started, let’s start the real application in each of the nodes. Execute the following shell command in all four Erlang shells starting with jc4
and pay attention to when the shell command returns.
1
(jc4@localhost)1> application:ensure_all_started(job_center).
1
(jc3@localhost)1> application:ensure_all_started(job_center).
1
(jc2@localhost)1> application:ensure_all_started(job_center).
1
(jc1@localhost)1> application:ensure_all_started(job_center).
You’ll notice that the shell will hang in the other nodes, returning only when the job_center
application is started in jc1
, as it is the node running with the highest priority. You’ll also notice from the shells that whereas the application is started on all four nodes, the job_center_sup
supervisor is only started on jc1
. If Erlang was correctly installed on your machine having a GUI, you can further verify this by running observer:start().
in each of the Erlang shells. Then, navigate to the Applications tab when the graphical interface pops up. Only jc1
will have both the kernel
and job_center
applications’ supervision trees.
Failover
While keeping an eye on both jc2
and jc3
, shut down jc1
by executing the halt()
Erlang shell command.
1
(jc1@localhost)2> halt().
The application controller will wait for 1000 milliseconds for jc1
to start, but it doesn’t, so you see a progress report for the job_center
application being started on either jc2
or jc3
(they have the same precedence, remember?).
Depending on where it was started (either jc2
or jc3
), also shut down that very node and see the application start on the other.
Once again, shut down the node where the application started and you see it start on jc4
, the only running node remaining.
Takeover
Now that the application is running on jc4
, go to the terminal tab/window which was running jc1
and restart the node (just recall the previous terminal command with the up arrow key and execute it). The node will wait for 15 seconds for the nonmandatory nodes jc2
and jc3
to start. After the timeout, start the job_center
application in jc1
just like before and you’ll immediately see a takeover from jc4
, where the behaviours are terminated and the supervision tree is taken down.
Conclusion
The strategies discussed in this blog post are valid as long as you have redundant hardware. Run an application on a ‘main’ computer or server, and should it fail, you move it to a backup one.
Beware, most of the Distributed Erlang concepts I’ve tried to discuss here assume that failures are likely due to hardware failure, and not a netsplit. In the real world, netsplits occurring more often will surely lead you to a world of misery, trying to resolve conflicts resulting from applications being run on both the main and backup nodes.