Home A... B... C... Distributed Erlang
Post
Cancel

A... B... C... Distributed Erlang

In my previous blog post titled Somewhere Right Now, a System is Failing (and how Erlang can help), I wrote about how Erlang approaches error handling and fault tolerance as compared to most other languages. At the end of it, I did mention that a future blog post could go into depth on how the Erlang runtime system comes with a convenient distribution mechanism designed into it from the start. Well, that future is here. But unfortunately, I may have lied about going into depth, as distributed programming is not for the faint of heart. Just like Fred Hebert said:

See, distributed programming is like being left alone in the dark, with monsters everywhere. It’s scary, and you don’t know what to do or what’s coming at you. Bad news: distributed Erlang is still leaving you alone in the dark to fight the scary monsters. It won’t do any of that kind of hard work for you. Good news: instead of being alone with nothing but pocket change and a poor sense of aim (making it harder to kill the monsters), Erlang gives you a flashlight, a machete, and a pretty kick-ass moustache to feel more confident (this also applies to female readers).

The term Distributed here refers to how systems are clustered together and interact with each other. A distributed Erlang system consists of several Erlang runtime systems (nodes) communicating with each other. Message passing between processes at different nodes, as well as links and monitors, are transparent when PIDs are used. Registered names, however, are local to each node. This means that the node must be specified as well when sending messages using registered names.

This is a demo-driven blog post so you need a working Erlang installation on your system to follow along. Use a package manager to install (or however you like—this is a free country, after all). Omitting installation steps allows me to pack in more useful examples and discussion of concepts, which is what you really came for anyway, right?

Setting up an Erlang cluster

Erlang nodes in a cluster are all given either short or long names to identify them. To use short or long names, start the Erlang VM with erl -sname short_name@domain and erl -name long_name@some.domain respectively.

Windows users should use werl instead of the erl.

Let’s start two nodes on the same host (open two terminal windows/tabs).

1
2
3
$ erl -sname dog
...<snip>...
(dog@localhost)1>
1
2
3
$ erl -sname cat
...<snip>...
(cat@localhost)1>

I chose to start the two nodes on the same host and, therefore, they’re using a default cookie in the ~/.erlang.cookie file. You can start them on different machines on the same LAN using the long name flag -name as in erl -name dog -setcookie abc (where abc is the cookie string) or even on different hosts in the internet, this time using erl -name dog -setcookie abc -kernel inet_dist_listen_min Min inet_dist_listen_max Max (where Min and Max define a range of ports to be used for distributed Erlang).

How about connecting the cat to the dog (maybe to ally)?

1
2
(dog@localhost)1> net_kernel:connect_node(cat@localhost).
true

This should return true as shown (meaning you’re in distributed mode, baby). A false means the nodes can’t connect. You either have a local network problem or you mistyped something and therefore are trying to connect to a non-existent node.

To see the list of connected nodes, use the BIFs node() and nodes().

1
2
(dog@localhost)2> [node()|nodes()].
[dog@localhost,cat@localhost]

Let’s just register the two Erlang shell processes with the node names.

1
2
(dog@localhost)3> register(dog, self()).
true
1
2
(cat@localhost)1> register(cat, self()).
true

Let the cat sends a “Hello” message to the dog.

1
2
(cat@ferdmbp)2> {dog, dog@localhost} ! {"Hello!", from, self()}.
{"Hello!",from,<0.90.0>}

The dog looks for “Hello!” messages in the mailbox and replies with “Oh, hi!”.

1
2
(dog@localhost)4> receive {"Hello!", from, CatShell} -> CatShell ! "Oh, hi!" end.
"Oh, hi!"

To confirm that the cat got a reply, flush the content of its mailbox.

1
2
3
(cat@localhost)3> flush().
Shell got "Oh, hi!"
ok

You can add as many nodes to the cluster as you like and send any Erlang data structure transparently from one node to another. That’s basically it for Distributed Erlang but let’s take this a tad further, shall we?

The concept of links and monitors still prevails even if processes are in nodes on physically separated machines. The dog can, for example, monitor the cat as:

1
2
3
4
5
6
7
8
9
10
(dog@localhost)5> process_flag(trap_exit, true).
false
(dog@localhost)6> link(CatShell).
true
(dog@localhost)7> erlang:monitor(process, CatShell).
#Ref<0.3668919864.3736862723.117013>
(dog@localhost)8> erlang:monitor_node('cat@localhost', true).
true
(dog@localhost)9> flush().
ok

On line 5, we make the dog shell process start trapping exits and link it to the cat process, whose PID was bound to the CatShell variable earlier in line 4. Then, we set up the dog to monitor the cat on line 7, as well as monitor the node 'cat@localhost' on (line 8). Finally, we flush the dog’s mailbox (line 9) just to make sure there’s no message in there at that very moment.

Now, go to the cat@localhost node and kill it:

1
(cat@localhost)4> halt().

The dog shell process should receive 'EXIT', 'DOWN', and nodedown signals for the link and the two monitors.

1
2
3
4
5
(dog@localhost)10> flush().
Shell got {'EXIT',<9385.90.0>,noconnection}
Shell got {'DOWN',#Ref<0.3668919864.3736862723.117013>,process,<9385.90.0>,noconnection}
Shell got {nodedown,cat@localhost}
ok

This same principle of detecting faults is used when architecting systems using Erlang/OTP. In the simple demo above, no corrective action was taken by the dog after learning of the cat’s death but it could decide to get a new friend (launch a new 'cat@localhost' node) or transfer the cat process or application to an existing redundant node, ..etc.

Distributed OTP Applications

OTP lets us define takeover and failover mechanisms, a convenient distribution mechanism for migrating applications across nodes. It won’t hold you by the hand all the way but can act as a temporary solution until a more complex solution can be put in place.

Distributed applications are managed by the distributed application controller. An instance of this process is found in the kernel supervision tree on every distributed Erlang node.

Failing and Taking Over

A failover means restarting an application somewhere other than where it went down (stopped running). The app is run on a main computer or server, and if it fails, it is moved to a backup one.

Taking over is when a dead node having a higher priority comes back from the dead and decides to run the application again. This is achieved by gracefully shutting down the backup application and starting the main one instead.

Imagine, if you may, a system with four nodes designed to run an application. Only the first node JC1 is running the application and the rest of the nodes are backup nodes in case JC1 dies. Let’s also assume JC2 and JC3 have the same precedence so only one of them can run the application at a time.

If JC1 fails (shown by the red colour), for a brief moment nothing will be running. After a while, either JC2 or JC3 (JC3 in this case) realises this and takes over running the application. This is called a failover.

If JC3 fails, the application will be moved to JC2 since it has higher precedence than JC4.

For some reason, JC2 also fails, and the application is now moved to JC4.

JC1 coming back up moves the application from JC4 back to JC1. This is called a takeover.

Because only showing you diagrams isn’t the most entertaining thing to do, and because pretending to run and distribute an application that doesn’t exist is even worse, I’ve extracted an application for us to use. The working and implementation details of the application are not important for this blog post. Just clone it.

Specifying Distributed Applications

Our cluster has four nodes, jc1@localhost, jc2@localhost, jc3@localhost, and jc4@localhost. All that’s left is to create a configuration file that defines which nodes are the main ones and which ones are backups. The configuration is already included in the config/sys.config file in the repository but let me just show the distribution-specific part here.

1
2
3
4
5
6
7
{kernel, [
  {distributed, [{job_center, 1000, [jc1@localhost, {jc2@localhost, jc3@localhost},
                                     jc4@localhost]}]},
  {sync_nodes_mandatory, [jc1@localhost]},
  {sync_nodes_optional, [jc2@localhost, jc3@localhost, jc4@localhost]},
  {sync_nodes_timeout, 15000}
]}

This just sets the distributed environment variable in the kernel application. The first number you see 1000 show the time in milliseconds to wait for the node to come back up. The node precedence is defined as: [jc1@localhost,{jc2@localhost,jc3@localhost},jc4@localhost] where jc1 has the highest precedence followed by both jc2 and jc3 (which have the same) and should both jc2 and jc3 fail, that’s when jc4 come in.

The {sync_nodes_mandatory, NodeList} environment variable defines nodes the distributed application controller must synchronize with; the system will start only if all of these nodes are started and connected within the timeout value defined in sync_nodes_timeout variable (15000 milliseconds).

Did you clone the repository yet? No? Let’s clone and compile the code (hoping you at least installed Erlang).

1
2
3
$ git clone https://github.com/lgmfred/jc.git
$ cd jc
$ erl -make

Let’s start a node, shall we? Any node you like apart from jc1 and wait for 15 seconds. I chose jc3.

1
$ erl -sname jc3@localhost -config config/sys -pa ebin/

Yikes! Joker strikes! Call Batman! This is pretty ugly—and definitely not the kind of error you would want your children to look like. The node waited 15 seconds (time set in sync_nodes_timeout) for jc1 to come up and then crashed with that long, hairy, and complicated error.

Open four terminal tabs/windows and start the first three nodes as shown:

1
$ erl -sname jc1@localhost -config config/sys -pa ebin/
1
$ erl -sname jc2@localhost -config config/sys -pa ebin/
1
$ erl -sname jc3@localhost -config config/sys -pa ebin/

The nodes will wait again for 15 seconds for the optional node jc4 to come up. If it doesn’t, the nodes will start regardless. After the three nodes above have all started, you can then start jc4 and see that it immediately starts without waiting since the mandatory node jc1 is already started.

1
$ erl -sname jc4@localhost -config config/sys -pa ebin/

Now that all nodes have started, let’s start the real application in each of the nodes. Execute the following shell command in all four Erlang shells starting with jc4 and pay attention to when the shell command returns.

1
(jc4@localhost)1> application:ensure_all_started(job_center).
1
(jc3@localhost)1> application:ensure_all_started(job_center).
1
(jc2@localhost)1> application:ensure_all_started(job_center).
1
(jc1@localhost)1> application:ensure_all_started(job_center).

You’ll notice that the shell will hang in the other nodes, returning only when the job_center application is started in jc1, as it is the node running with the highest priority. You’ll also notice from the shells that whereas the application is started on all four nodes, the job_center_sup supervisor is only started on jc1. If Erlang was correctly installed on your machine having a GUI, you can further verify this by running observer:start(). in each of the Erlang shells. Then, navigate to the Applications tab when the graphical interface pops up. Only jc1 will have both the kernel and job_center applications’ supervision trees.

Failover

While keeping an eye on both jc2 and jc3, shut down jc1 by executing the halt() Erlang shell command.

1
(jc1@localhost)2> halt().

The application controller will wait for 1000 milliseconds for jc1 to start, but it doesn’t, so you see a progress report for the job_center application being started on either jc2 or jc3 (they have the same precedence, remember?).

Depending on where it was started (either jc2 or jc3), also shut down that very node and see the application start on the other.

Once again, shut down the node where the application started and you see it start on jc4, the only running node remaining.

Takeover

Now that the application is running on jc4, go to the terminal tab/window which was running jc1 and restart the node (just recall the previous terminal command with the up arrow key and execute it). The node will wait for 15 seconds for the nonmandatory nodes jc2 and jc3 to start. After the timeout, start the job_center application in jc1 just like before and you’ll immediately see a takeover from jc4, where the behaviours are terminated and the supervision tree is taken down.

Conclusion

The strategies discussed in this blog post are valid as long as you have redundant hardware. Run an application on a ‘main’ computer or server, and should it fail, you move it to a backup one.

Beware, most of the Distributed Erlang concepts I’ve tried to discuss here assume that failures are likely due to hardware failure, and not a netsplit. In the real world, netsplits occurring more often will surely lead you to a world of misery, trying to resolve conflicts resulting from applications being run on both the main and backup nodes.

References

  1. Erlang Reference Manual Users’ Guide
  2. OTP Design Principles User’s Guide
  3. Learn You Some Erlang for great good!
  4. Programming Erlang (2nd edition)
  5. Designing for Scalability with Erlang/OTP