Node Failure Recovery
By the end of this exercise, you should be able to:
- Anticipate swarm scheduling decisions when nodes fail and recover
- Force swarm to reallocate workload across a swarm
Setting up a Service
Set up an
microsoft/iisservice with four replicas onnode-0, and wait for all four tasks to be up and running:PS: node-0 Administrator> docker service create --replicas 4 --name iis ` microsoft/iis
Simulating Node Failure
Login to the non-manager node in your swarm (
node-3), and simulate a node failure by rebooting it:PS: node-3 Administrator> Restart-Computer -ForceBack on
node-0, keep doingdocker service ps iisevery few seconds; what happens to the task running on the rebooted node? Look at its desired state, any other tasks that get scheduled with the same name, and keep watching untilnode-3comes back online.
Force Rebalancing
By default, if a node fails and rejoins a swarm it will not get its old workload back; if we want to redistribute workload across a swarm after new nodes join (or old nodes rejoin), we need to force-rebalance our tasks.
Make sure
node-3has fully rebooted and rejoined the swarm.Force rebalance the tasks:
PS: node-0 Administrator> docker service update --force iisJust like when you changed the configuration of your service to add an environment variable to its containers in the previous exercise, you'll see each task get shut down and rescheduled, until the final scheduling state of the service has spread tasks out across the swarm as if it was a freshly scheduled service.
Cleanup
Remove all existing services, in preparation for future exercises:
PS: node-0 Administrator> docker service rm $(docker service ls -q)
Conclusion
In this exercise, you saw swarm's scheduler in action - when a node is lost from the swarm, tasks are automatically rescheduled to restore the state of our services. Note that nodes joining or rejoining the swarm do not get workload automatically reallocated from existing nodes to them; rescheduling only happens when tasks crash, services are first scheduled, or you force a reschedule as above.