Note: N|Solid isn't fully supported to run on NFS


Expected Result

Run a docker swarm environment with 3 nodes (1 manager and 2 workers). These 3 nodes are sharing the same NFS directory mount. N|Solid Console is deployed in the manager node. If that node goes down, Swarm will failover to the 2nd or 3rd node, and will be able to recover N|Solid Console using the same shared NFS shared directory.


Reproduce steps:

We need to create/prepare 4 virtual machines using Vagrant, all of them are running CentOS 7.1. The brief functionality for each machine:

  • vm 1: Contains the NFS mount and configuration
  • vm 2: Docker Swarm - manager node: sharing the NFS mount from vm 1
  • vm 3 - 4: Docker Swarm - worker node: sharing the NFS mount from vm 1


Create a Vagrant file in a local folder with this content:


$nfs_mount = <<SCRIPT
# configure hostname
hostnamectl set-hostname nfs-mount
# update
#yum update -y
# install nfs
yum install nfs-utils -y
# create the nfs folder
mkdir /var/nfsshare
# change the permissions of the folder
chmod -R 755 /var/nfsshare
chown nfsnobody:nfsnobody /var/nfsshare
# start the necessary services
systemctl enable rpcbind
systemctl enable nfs-server
systemctl enable nfs-lock
systemctl enable nfs-idmap
systemctl enable firewalld
systemctl start rpcbind
systemctl start nfs-server
systemctl start nfs-lock
systemctl start nfs-idmap
systemctl start firewalld
# share the nfs directory with all clients
echo "
/var/nfsshare    *(rw,sync,no_root_squash,no_all_squash)
/home            *(rw,sync,no_root_squash,no_all_squash)
" >> /etc/exports
# start the nfs service
systemctl restart nfs-server
# override centos firewall
firewall-cmd --permanent --zone=public --add-service=nfs
firewall-cmd --permanent --zone=public --add-service=mountd
firewall-cmd --permanent --zone=public --add-service=rpc-bind
firewall-cmd --reload
SCRIPT

$docker_swarm = <<SCRIPT
# make the manager and workers accessible
echo "
10.1.1.162  managernode
10.1.1.163  workernode1
10.1.1.164  workernode2
" >> /etc/hosts
# configure hostname
hostnamectl set-hostname $1
# install docker
wget https://download.docker.com/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker.repo
yum install docker-ce -y
# start docker services
systemctl enable docker
systemctl enable firewalld
systemctl start docker
systemctl start firewalld
# open ports for docker swarm
firewall-cmd --permanent --add-port=2376/tcp
firewall-cmd --permanent --add-port=2377/tcp
firewall-cmd --permanent --add-port=7946/tcp
firewall-cmd --permanent --add-port=80/tcp
firewall-cmd --permanent --add-port=7946/udp
firewall-cmd --permanent --add-port=4789/udp
# restart the sevice
firewall-cmd --reload
systemctl restart docker
# nfs needs to be available in all nodes
yum install nfs-utils -y
# create directories for the nfs directory
mkdir -p /mnt/nfs/home
mkdir -p /mnt/nfs/var/nfsshare
# mount the nfs directory into the system
mount -t nfs 10.1.1.161:/var/nfsshare /mnt/nfs/var/nfsshare/ -o nolock
# configuring the manager node
if [[ $1 == 'managernode' ]]; then
  # set this node as manager
  docker swarm init --advertise-addr 10.1.1.162
  # export worker token for swarm nodes
  docker swarm join-token worker -q > /vagrant/worker_token
fi

if [[ $1 == 'workernode1' || $1 == 'workernode2' ]]; then
  # join this node in swarm as a worker
  docker swarm join --token $(cat /vagrant/worker_token) 10.1.1.162:2377
fi
SCRIPT

Vagrant.configure("2") do |config|

  (1..4).each do |i|

    config.vm.define "vm_#{i}" do |s|
      s.vm.box = "bento/centos-7.3"
      s.vm.network "private_network", ip: "10.1.1.16#{i}", netmask: "255.255.255.0", auto_config: true
      # necessary to share token for swarm
      s.vm.synced_folder ".", "/vagrant"
      # nfs mount config
      if i == 1
        s.vm.network :forwarded_port, guest: 8080, host: 8080
        s.vm.network :forwarded_port, guest: 5000, host: 5000
        # nfs mount config
        s.vm.provision :shell, inline: $nfs_mount, privileged: true
      end
      if i == 2
        # Docker Swarm manager node
        s.vm.provision :shell, inline: $docker_swarm, args: "'managernode'", privileged: true
      end
      if i == 3
        # Docker Swarm worker node 1
        s.vm.provision :shell, inline: $docker_swarm, args: "'workernode1'", privileged: true
      end
      if i == 4
        # Docker Swarm worker node 2
        s.vm.provision :shell, inline: $docker_swarm, args: "'workernode2'", privileged: true
      end
      s.vm.provider "virtualbox" do |vb|
        vb.name = "vm_#{i}"
        vb.memory = "1024"
        vb.cpus = "2"
        vb.gui = false
      end
    end

  end

end


And then a Dockerfile in the same folder of the Vagrant file with the following content:


version: "3.1"

services:

  nsolid-console:
    image: nodesource/nsolid-console:latest
    environment:
      - NSOLID_CONSOLE_LICENSE_KEY=93224ed2-edb0-4f98-af83-cd66b12adbea
    ports:
      - 6753:6753
    networks:
      - nsolid
    volumes:
      - /mnt/nfs/var/nfsshare/console:/var/lib/nsolid/console
    deploy:
      replicas: 2
      mode: replicated
      resources:
        limits:
          cpus: '2'
          memory: 2G
      restart_policy:
        condition: on-failure

networks:
  nsolid:


Replace NSOLID_CONSOLE_LICENSE_KEY with the key that was provided to you.


Then run:


$ vagrant up


SSH into the manager node and run:


$ vagrant ssh vm_2


Start the nsolidconsole service in Docker Swarm:


$ docker stack deploy -c /vagrant/Dockerfile ns


You can check the status of the container running:


$ docker service ls


The output should look like:


ID                  NAME                MODE                REPLICAS            IMAGE                                                 PORTS
xxx        ns_nsolid-console   replicated                  3/3                 nodesource/nsolid-console:latest   *:6753->6753/tcp


Check the REPLICAS column to be 3/3, if not, check the logs of the container using:


$ sudo docker service logs [ ID ] > /vagrant/logs-console


After finishing, you should get the N|Solid Console running on the following ip: 


# manager node ip address
10.1.1.162:6753
# worker1 node ip address
10.1.1.163:6753
# worker2 node ip adress
10.1.1.164:6753


All of them sharing a shared NFS directory mount configured in 10.1.1.161.


Possible issues when using NFS

  • Check the mount option flag has nolock
  • Check the services needed for NFS and nolock are running on every reboot


Blockers/Considerations:

  • UDP is generally not a valid transport protocol for NFSv4. Early implementations of NFS 4.0 still allowed UDP which allows the UDP transport protocol to be used in rare cases. RFC5661 explicitly states UDP alone should not be used for the transport protocol in NFS 4.1. Errors due to unsupported transport protocols for specific versions of NFS are not always clear. A common error message when attempting to use UDP with NFSv4
  • The problems with NFS is it's remote and shared, the server has to do things around the cache and locking (since other hosts accessing the same files could try to change them) which causes problems with applications that expect that the files are local.


Other solutions/possibilities to use NFS:

  • N|Solid isn't fully supported to run on NFS but we think a disk image hosted on NFS could be a possible solution. This disk image is different because although the image file is potentially shared when you mount the image the local operating system sees the image as a local filesystem and assumes that regular file access and locking is safe. The other side of this is if you do the same directory mount on multiple systems at the same time it's likely to corrupt it due to these assumptions.