Machine deletion process

Machine deletions occur in various cases, for example:

Control plane (e.g. KCP) or MachineDeployment rollouts
Scale downs of MachineDeployments / MachineSets
Machine remediations
Machine deletions (e.g. kubectl delete machine)

This page describes how Cluster API deletes Machines.

Machine deletion can be broken down into the following phases:

Machine deletion is triggered (i.e. the metadata.deletionTimestamp is set)
Machine controller waits until all pre-drain hooks succeeded, if any are registered
- Pre-drain hooks can be registered by adding annotations with the pre-drain.delete.hook.machine.cluster.x-k8s.io prefix to the Machine object
Machine controller checks if the Machine should be drained, drain is skipped if:
- The Machine has the machine.cluster.x-k8s.io/exclude-node-draining annotation
- The Machine.spec.nodeDrainTimeout field is set and already expired (unset or 0 means no timeout)
- The Machine is owned by a KubeadmControlPlane and the pre-terminate hook has been already removed
If the Machine should be drained, the Machine controller evicts all relevant Pods from the Node (see details in Node drain)
Machine controller checks if we should wait until all volumes are detached, this is skipped if:
- The Machine has the machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach annotation
- The Machine.spec.nodeVolumeDetachTimeout field is set and already expired (unset or 0 means no timeout)
- The Machine is owned by a KubeadmControlPlane and the pre-terminate hook has been already removed
If we should wait for volume detach, the Machine controller waits until Node.status.volumesAttached is empty and there are no more VolumeAttachment objects that indicate that there are still volumes attached to the Node
- Typically the volumes are getting detached by CSI after the corresponding Pods have been evicted during drain
Machine controller waits until all pre-terminate hooks succeeded, if any are registered
- Pre-terminate hooks can be registered by adding annotations with the pre-terminate.delete.hook.machine.cluster.x-k8s.io prefix to the Machine object
Machine controller deletes the InfrastructureMachine object (e.g. DockerMachine) of the Machine and waits until it is gone
Machine controller deletes the BootstrapConfig object (e.g. KubeadmConfig) of the machine and waits until it is gone
Machine controller deletes the Node object in the workload cluster
- Node deletion will be retried until either the Node object is gone or Machine.spec.nodeDeletionTimeout is expired (0 means no timeout, but the field defaults to 10s)
- Note: Nodes are usually also deleted by cloud controller managers, which is why Cluster API per default only tries to delete Nodes for 10s.

Note: There are cases where Node drain, wait for volume detach and Node deletion is skipped. For these please take a look at the implementation of the isDeleteNodeAllowed function.

Node drain

This section describes details of the Node drain process in Cluster API. Cluster API implements Node drain aligned with kubectl drain. One major difference is that the Cluster API controller does not actively wait during Reconcile until all Pods are drained from the Node. Instead it continuously evicts Pods and requeues after 20s until all relevant Pods have been drained from the Node or until the Machine.spec.nodeDrainTimeout is reached (if configured).

Node drain can be broken down into the following phases:

Node is cordoned (i.e. the Node.spec.unschedulable field is set, which leads to the node.kubernetes.io/unschedulable:NoSchedule taint being added to the Node)
- This prevents that Pods that already have been evicted are rescheduled to the same Node. Please only tolerate this taint if you know what you are doing! Otherwise it can happen that the Machine controller is stuck continuously evicting the same Pods.
Machine controller calculates the list of Pods that have to be drained from the Node. Pods can be categorized as follows:
- Pods that are skipped/ignored during drain:
  - Pods belonging to an existing DaemonSet (orphaned DaemonSet Pods have to be evicted as well)
  - Mirror Pods, i.e. Pods with the kubernetes.io/config.mirror annotation (usually static Pods managed by kubelet, like kube-apiserver)
  - Pods with the cluster.x-k8s.io/drain=skip label
  - Pods that match a MachineDrainRule with behavior Skip
- Pods that should not be evicted, but we have to wait for their completion:
  - Pods with the cluster.x-k8s.io/drain=wait-completed label
  - Pods that match a MachineDrainRule with behavior WaitCompleted
- Pods that should be evicted:
  - Pods that match a MachineDrainRule with behavior Drain
  - All Pods not belonging to any of the other categories
If there are no more Pods that have to be drained Node drain is completed
Otherwise we have to wait for Pods to complete and/or evict Pods
- There are various reasons why an eviction could fail:
  - The eviction would violate a PodDisruptionBudget, i.e. not enough Pod replicas would be available if the Pod would be evicted
  - The namespace is in terminating, in this case the kube-controller-manager is responsible for setting the .metadata.deletionTimestamp on the Pod
  - Other errors, e.g. a connection issue when calling the eviction API of the workload cluster
- Please note that when an eviction goes through, this only means that the .metadata.deletionTimestamp is set on the Pod, but the Pod also has to be terminated and the Pod object has to go away for the drain to complete.
These steps are repeated every 20s until all relevant Pods have been drained from the Node

Per default all Pods are drained at the same time. But with MachineDrainRules it’s also possible to define a drain order for Pods with behavior Drain (Pods with WaitCompleted have a hard-coded order of 0). The Machine controller will drain Pods in batches based on their order (from highest to lowest order).

For more details about MachineDrainRules, please see the corresponding proposal.

Special cases:

If the Node doesn’t exist anymore, Node drain is entirely skipped
If the Node is unreachable (i.e. the Node Ready condition is in status Unknown):
- Pods with .metadata.deletionTimestamp more than 1s in the past are ignored
- Pod evictions will use 1s GracePeriodSeconds, i.e. the terminationGracePeriodSeconds field from the Pod spec will be ignored.
- Note: PodDisruptionBudgets are still respected, because both of these changes are only relevant if the call to trigger the Pod eviction goes through. But Pod eviction calls are rejected when PodDisruptionBudgets would be violated by the eviction.

Observability

The drain process can be observed through the DrainingSucceeded condition on the Machine and various logs.

Example condition

To determine which Pods are blocking the drain and why you can take a look at the DrainingSucceeded condition on the Machine, e.g.:

status:
  ...
  conditions:
  ...
  - lastTransitionTime: "2024-08-30T13:36:27Z"
    message: |-
      Drain not completed yet:
      * Pods with deletionTimestamp that still exist: cert-manager/cert-manager-756d54fb98-hcb6k
      * Pods with eviction failed:
        * Cannot evict pod as it would violate the pod's disruption budget. The disruption budget nginx needs 10 healthy pods and has 10 currently: test-namespace/nginx-deployment-6886c85ff7-2jtqm, test-namespace/nginx-deployment-6886c85ff7-7ggsd, test-namespace/nginx-deployment-6886c85ff7-f6z4s, ... (7 more)
    reason: Draining
    severity: Info
    status: "False"
    type: DrainingSucceeded

Example logs

When cordoning the Node:

I0830 12:50:13.961156      17 machine_controller.go:716] "Cordoning Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz"

When starting the drain:

I0830 12:50:13.961156      17 machine_controller.go:716] "Draining Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz"

Immediately before Pods are evicted:

I0830 12:52:58.739093      17 drain.go:172] "Drain not completed yet, there are still Pods on the Node that have to be drained" ... Node="my-cluster-md-0-wxtcg-mtg57-ssfg8" podsToTriggerEviction="test-namespace/nginx-deployment-6886c85ff7-4r297, test-namespace/nginx-deployment-6886c85ff7-5gl2h, test-namespace/nginx-deployment-6886c85ff7-64tf9, test-namespace/nginx-deployment-6886c85ff7-9k5gp, test-namespace/nginx-deployment-6886c85ff7-9mdjw, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn"

On log level 4 it is possible to observe details of the Pod evictions, e.g.:

I0830 13:29:56.211951      17 drain.go:224] "Evicting Pod" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw"
I0830 13:29:56.211951      17 drain.go:229] "Pod eviction successfully triggered" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw"

After Pods have been evicted, either the drain is directly completed:

I0830 13:29:56.235398      17 machine_controller.go:727] "Drain completed, remaining Pods on the Node have been evicted" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh"

or we are requeuing:

I0830 13:29:56.235398      17 machine_controller.go:736] "Drain not completed yet, requeuing in 20s" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" podsFailedEviction="test-namespace/nginx-deployment-6886c85ff7-77fpw, test-namespace/nginx-deployment-6886c85ff7-8dq4q, test-namespace/nginx-deployment-6886c85ff7-8gjhf, test-namespace/nginx-deployment-6886c85ff7-jznjw, test-namespace/nginx-deployment-6886c85ff7-l5nj8, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn"

Eventually the Machine controller should log

I0830 13:29:56.235398      17 machine_controller.go:702] "Drain completed" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh"

If this doesn’t happen, please take a closer at the logs to determine which Pods still have to be evicted or haven’t gone away yet (i.e. deletionTimestamp is set but the Pod objects still exist).

For more information, please see:

The Cluster API Book

Machine deletion process

Node drain

Observability

Related documentation