Instructor Demo: Self-Healing Swarm
In this demo, we'll illustrate:
- Setting up a swarm
- How swarm makes basic scheduling decisions
- Actions swarm takes to self-heal a docker service
Setting Up a Swarm
Start by making sure no containers are running on any of your nodes:
[centos@node-0 ~]$ docker container rm -f $(docker container ls -aq) [centos@node-1 ~]$ docker container rm -f $(docker container ls -aq) [centos@node-2 ~]$ docker container rm -f $(docker container ls -aq) [centos@node-3 ~]$ docker container rm -f $(docker container ls -aq)Initialize a swarm on one node:
[centos@node-0 ~]$ docker swarm init Swarm initialized: current node (xyz) is now a manager. To add a worker to this swarm, run the following command: docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.List the nodes in your swarm:
[centos@node-0 ~]$ docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS xyz * node-0 Ready Active LeaderAdd some workers to your swarm by cutting and pasting the
docker swarm join...token Docker provided in step 2 above:[centos@node-1 ~]$ docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 [centos@node-2 ~]$ docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 [centos@node-3 ~]$ docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377Each node should report
This node joined a swarm as a worker.after joining.Back on your first node, list your swarm members again:
[centos@node-0 ~]$ docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ghi node-3 Ready Active def node-2 Ready Active abc node-1 Ready Active xyz * node-0 Ready Active LeaderYou have a four-member swarm, ready to accept workloads.
Scheduling Workload
Create a service on your swarm:
[centos@node-0 ~]$ docker service create \ --replicas 4 \ --name service-demo \ centos:7 ping 8.8.8.8List what processes have been started for your service:
[centos@node-0 ~]$ docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE g3dimc0nkoha service-demo.1 centos:7 node-3 Running Running 18 seconds ago e7d7sy5saqqo service-demo.2 centos:7 node-0 Running Running 18 seconds ago wv0culf6w8m6 service-demo.3 centos:7 node-1 Running Running 18 seconds ago ty35gss71mpf service-demo.4 centos:7 node-2 Running Running 18 seconds agoOur service has scheduled four tasks, one on each node in our cluster; by default, swarm tries to spread tasks out evenly across hosts, but much more sophisticated scheduling controls are also available.
Maintaining Desired State
Connect to
node-1, and list the containers running there:[centos@node-1 ~]$ docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES 5b5f77c67eff centos:7 "ping 8.8.8.8" 4 minutes ago Up 4 minutes service-demo.3.wv0cul...Note the container's name indicates the service it belongs to.
Let's simulate a container crash, by killing off this container:
[centos@node-1 ~]$ docker container rm -f <container ID>Back on our swarm manager, list the processes running for our
service-demoservice again:[centos@node-0 ~]$ docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE g3dimc0nkoha service-demo.1 centos:7 node-3 Running Running 6 minutes ago e7d7sy5saqqo service-demo.2 centos:7 node-0 Running Running 6 minutes ago u7l8vf2hqw0z service-demo.3 centos:7 node-1 Running Running 3 seconds ago wv0culf6w8m6 \_ service-demo.3 centos:7 node-1 Shutdown Failed 3 seconds ago ty35gss71mpf service-demo.4 centos:7 node-2 Running Running 6 minutes agoSwarm has automatically started a replacement container for the one you killed on
node-1. Go back over tonode-1, and dodocker container lsagain; you'll see a new container for this service up and running.Next, let's simulate a complete node failure by rebooting one of our nodes:
[centos@node-3 ~]$ sudo reboot nowBack on your swarm manager, check your service containers again:
[centos@node-0 ~]$ docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ralm6irhj6vu service-demo.1 centos:7 node-0 Running Running 19 seconds ago g3dimc0nkoha \_ service-demo.1 centos:7 node-3 Shutdown Running 38 seconds ago e7d7sy5saqqo service-demo.2 centos:7 node-0 Running Running 12 minutes ago u7l8vf2hqw0z service-demo.3 centos:7 node-1 Running Running 5 minutes ago wv0culf6w8m6 \_ service-demo.3 centos:7 node-1 Shutdown Failed 5 minutes ago ty35gss71mpf service-demo.4 centos:7 node-2 Running Running 12 minutes agoThe process on
node-3has been scheduled forSHUTDOWNwhen the swarm manager lost connection to that node, and meanwhile the workload has been rescheduled ontonode-0in this case. Whennode-3comes back up and rejoins the swarm, its container will be confirmed to be in theSHUTDOWNstate, and reconciliation is complete.Remove your
service-demo:[centos@node-0 ~]$ docker service rm service-demoAll tasks and containers will be removed.
Conclusion
One of the great advantages of the portability of containers is that we can imagine orchestrators like Swarm which can schedule and re-schedule workloads across an entire datacenter, such that if a given node fails, all its workload can be automatically moved to another host with available resources. In the above example, we saw the most basic examples of this 'reconciliation loop' that swarm provides: the swarm manager is constantly monitoring all the containers it has scheduled, and replaces them if they fail or their hosts become unreachable, completely automatically.