Skip to content

Etcd ealth latency

ETCD Health & Latency

ETCD is the "brain" of OpenShift; it is the distributed key-value store that holds every piece of configuration and state information about the cluster. If ETCD is slow or unstable, the entire cluster (including the Web Console and CLI) will feel laggy, unresponsive, or may even fail to process requests.

Common Symptoms

  • Command Lag: Standard oc commands take 10+ seconds to respond or time out.
  • Console Timeouts: The Web Console shows "An error occurred," "Service Unavailable," or takes a significant amount of time to load dashboard elements.
  • Operator Alerts: The etcd Cluster Operator shows a status of Degraded, Unavailable, or Progressing for an extended period.

Diagnostic Steps

  • Check ETCD Operator Status: This is the primary health check for L1. If this operator is not Available: True, there is a significant platform issue.
oc get co etcd

Check Member Health: Verify that the three ETCD members (running on the Master/Control Plane nodes) are synced and their pods are healthy.

# Check the status of the ETCD custom resource
oc get etcd -o api-version

# Check the status of the ETCD pods in the core namespace
oc get pods -n openshift-etcd

Monitor Disk Latency: ETCD is extremely sensitive to disk I/O. If the hardware cannot write to the disk fast enough, the cluster will lose sync.

# Check for 'slow fdatasync' warnings in the logs which indicate disk performance issues
oc logs -n openshift-etcd -l app=etcd | grep -i "slow fdatasync"

Escalation Criteria: > 1. If an ETCD member is Unhealthy or the operator is Degraded, escalate to L2 Cluster Admins immediately. ETCD issues can lead to total cluster failure. 2. If a namespace is stuck in Terminating and clearing resources doesn't work, do not attempt to "force delete" via the API; escalate to L2.