Instructor Demo: Self-Healing Swarm
In this demo, we'll illustrate:
- Setting up a swarm
- How swarm makes basic scheduling decisions
- Actions swarm takes to self-heal a docker service
Setting Up a Swarm
Start by making sure no containers are running on any of your nodes:
PS: node-0 Administrator> docker container rm -f $(docker container ls -aq) PS: node-1 Administrator> docker container rm -f $(docker container ls -aq) PS: node-2 Administrator> docker container rm -f $(docker container ls -aq) PS: node-3 Administrator> docker container rm -f $(docker container ls -aq)Initialize a swarm on
node-0:PS: node-0 Administrator> $PRIVATEIP = '<node-0 private IP>' PS: node-0 Administrator> docker swarm init ` --advertise-addr ${PRIVATEIP} ` --listen-addr ${PRIVATEIP}:2377 Swarm initialized: current node (xyz) is now a manager. To add a worker to this swarm, run the following command: docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.List the nodes in your swarm:
PS: node-0 Administrator> docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS xyz * node-0 Ready Active LeaderAdd some workers to your swarm by cutting and pasting the
docker swarm join...token Docker provided in step 2 above:PS: node-1 Administrator> docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 PS: node-2 Administrator> docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377 PS: node-3 Administrator> docker swarm join --token SWMTKN-1-0s96... 10.10.1.40:2377Each node should report
This node joined a swarm as a worker.after joining.Back on
node-0, list your swarm members again:PS: node-0 Administrator> docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ghi node-3 Ready Active def node-2 Ready Active abc node-1 Ready Active xyz * node-0 Ready Active LeaderYou have a four-member swarm, ready to accept workloads.
Scheduling Workload
Create a service on your swarm:
PS: node-0 Administrator> docker service create ` --replicas 4 ` --name service-demo ` microsoft/nanoserver:latest ping -t 8.8.8.8List what processes have been started for your service:
PS: node-0 Administrator> docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE g3dimc0nkoha service-demo.1 microsoft/nanoserver:latest node-3 Running Running 18 seconds ago e7d7sy5saqqo service-demo.2 microsoft/nanoserver:latest node-0 Running Running 18 seconds ago wv0culf6w8m6 service-demo.3 microsoft/nanoserver:latest node-1 Running Running 18 seconds ago ty35gss71mpf service-demo.4 microsoft/nanoserver:latest node-2 Running Running 18 seconds agoOur service has scheduled four tasks, one on each node in our cluster; by default, swarm tries to spread tasks out evenly across hosts, but much more sophisticated scheduling controls are also available.
Maintaining Desired State
Connect to
node-1, and list the containers running there:PS: node-1 Administrator> docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES 5b5f77c67eff 54.152.61.101 "powershell ping 8.8.8.8" 4 minutes ago Up 4 minutes service-demo.3.wv0cul...Note the container's name indicates the service it belongs to.
Let's simulate a container crash, by killing off this container:
PS: node-1 Administrator> docker container rm -f <container ID>Back on our swarm manager
node-0, list the processes running for ourservice-demoservice again:PS: node-0 Administrator> docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE g3dimc0nkoha service-demo.1 microsoft/nanoserver:latest node-3 Running Running 6 minutes ago e7d7sy5saqqo service-demo.2 microsoft/nanoserver:latest node-0 Running Running 6 minutes ago u7l8vf2hqw0z service-demo.3 microsoft/nanoserver:latest node-1 Running Running 3 seconds ago wv0culf6w8m6 \_ service-demo.3 microsoft/nanoserver:latest node-1 Shutdown Failed 3 seconds ago ty35gss71mpf service-demo.4 microsoft/nanoserver:latest node-2 Running Running 6 minutes agoSwarm has automatically started a replacement container for the one you killed on
node-1. Go back over tonode-1, and dodocker container lsagain; you'll see a new container for this service up and running.Next, let's simulate a complete node failure by rebooting one of our nodes; on
node-3, navigate Start menu -> Power -> Restart.Back on your swarm manager, check your service containers again; after a few moments, you should see something like:
PS: node-0 Administrator> docker service ps service-demo ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ralm6irhj6vu service-demo.1 microsoft/nanoserver:latest node-0 Running Running 19 seconds ago g3dimc0nkoha \_ service-demo.1 microsoft/nanoserver:latest node-3 Shutdown Running 38 seconds ago e7d7sy5saqqo service-demo.2 microsoft/nanoserver:latest node-0 Running Running 12 minutes ago u7l8vf2hqw0z service-demo.3 microsoft/nanoserver:latest node-1 Running Running 5 minutes ago wv0culf6w8m6 \_ service-demo.3 microsoft/nanoserver:latest node-1 Shutdown Failed 5 minutes ago ty35gss71mpf service-demo.4 microsoft/nanoserver:latest node-2 Running Running 12 minutes agoThe process on node-3 has been scheduled for
SHUTDOWNwhen the swarm manager lost connection to that node, and meanwhile the workload has been rescheduled onto node-0 in this case. When node-3 comes back up and rejoins the swarm, its container will be confirmed to be in theSHUTDOWNstate, and reconciliation is complete.Remove your
service-demo:PS: node-0 Administrator> docker service rm service-demoAll tasks and containers will be removed.
Conclusion
One of the great advantages of the portability of containers is that we can imagine orchestrators like Swarm which can schedule and re-schedule workloads across an entire datacenter, such that if a given node fails, all its workload can be automatically moved to another host with available resources. In the above example, we saw the most basic examples of this 'reconciliation loop' that swarm provides: the swarm manager is constantly monitoring all the containers it has scheduled, and replaces them if they fail or their hosts become unreachable, completely automatically.