"Communication Error On Send" With BeeGFS

2017-11-26 09:30 by Thomas Urban

Following previous versions of our tutorial on setting up BeeGFS and Docker for running a swarm of containers sharing data using distributed filesystem your setup might end up in a BeeGFS cluster stopping to work. Deploying services might result errors messages like "communication error on send". The same message may appear on trying to list files in distributed filesystem using CLI.

In such a situation you should run beegfs-check-server and inspect the IP addresses listed per node.

Management
==========
node1 [ID: 1]: reachable at 94.130.185.164:8008 (protocol: TCP)

Metadata
==========
node1 [ID: 1]: reachable at 94.130.185.164:8005 (protocol: TCP)
node2 [ID: 2]: reachable at 138.201.189.209:8005 (protocol: TCP)
node3 [ID: 3]: reachable at 172.17.0.1:8005 (protocol: TCP)

Storage
==========
node1 [ID: 1]: reachable at 94.130.185.164:8003 (protocol: TCP)
node2 [ID: 2]: reachable at 172.17.0.1:8003 (protocol: TCP)
node3 [ID: 3]: reachable at 172.17.0.1:8003 (protocol: TCP)

This example lists same IP address for different nodes. Checking nodes in more detail using command beegfs-ctl --listnodes --nodetype=storage --details reveals this:

node1 [ID: 1]
    Ports: UDP: 8003; TCP: 8003
    Interfaces: eth0(TCP)
node2 [ID: 2]
    Ports: UDP: 8003; TCP: 8003
    Interfaces: docker0(TCP) docker_gwbridge(TCP) eth0(TCP)
node3 [ID: 3]
    Ports: UDP: 8003; TCP: 8003
    Interfaces: docker0(TCP) docker_gwbridge(TCP) eth0(TCP)

Checking IP of docker0 bridge reveals the same IP as was listed before: 172.17.0.1.

What's Happening?

Obviously some services of BeeGFS are binding to all available NICs by default. This way the IP address of internal docker bridge is advertised to the peer nodes of cluster. However, by connecting with that advertised IP those peer nodes are using their own local bridge devices, only, and thus never get in touch with the related node.

How To Fix It?

In either configuration file of BeeGFS there is an option named connInterfacesFile which is unset by default. This option takes path name of a file listing NICs to bind explicitly. So, first you need to add another file, e.g. /etc/beegfs/beegfs-nics.conf and write the name of your NIC linking your server to the public. Here it is eth0. You may write multiple names in separate lines of that file. Next you need to adjust every configuration file of BeeGFS including that option on every node. The files are:

/etc/beegfs/beegfs-mgmt.conf (on single node providing management service, only)
/etc/beegfs/beegfs-meta.conf
/etc/beegfs/beegfs-storage.conf
/etc/beegfs/beegfs-client.conf

In either file search for the line reading

connInterfacesFile =

and replace it with

connInterfacesFile = /etc/beegfs/beegfs-nics.conf

After rebooting all nodes one after the other (or at least after restarting BeeGFS services on either node) the cluster should be working again. beegfs-check-servers is listing proper IPs for every node in either service section.

Go back