Troubleshooting Node Pressure and Resource Exhaustion¶

This guide outlines the procedures for diagnosing and resolving Node-level issues, specifically when a Node enters a "Pressure" state (Memory, Disk, or PID) or experiences general resource exhaustion.

Initial Node Diagnostics¶

When alerts indicate that a Node is unstable or workloads are being evicted, use these commands to assess the health of the infrastructure:

# List all nodes and their current status
oc get nodes

# View resource usage (CPU/Memory) across all nodes
oc adm top nodes

# Inspect a specific node for Pressure conditions
oc describe node <node_name> | grep -A 5 "Conditions"

1. Memory Pressure¶

Memory Pressure occurs when the Node's available RAM falls below the eviction threshold, threatening the stability of the host operating system.

Common Symptoms¶

Node Condition: Node status reflects MemoryPressure: True.
Pod Eviction: Pods are terminated with the reason Evicted and the message Node had memory pressure.
System Instability: System processes or the Kubelet become unresponsive or slow to react to commands.

Resolution Steps¶

Identify Top Memory Consumers: Identify which Pods are consuming the most memory across the cluster or on the specific node to find outliers.

oc adm top pods -A --sort-by=memory

Check for OOM Events: Inspect the node-level kernel logs to identify if the Out-Of-Memory (OOM) killer has been triggered.

oc adm node-logs <node_name> | grep -i "oom"

Verify Resource Limits: Ensure that workloads have explicit memory limits defined in their deployment manifests. This prevents a single "leaking" container from consuming all available RAM on a Node.

oc get pod <pod_name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'

2. Disk Pressure¶

Disk Pressure occurs when the Node’s root filesystem or the dedicated storage allocated for container images is nearly full, reaching the eviction threshold.

Common Symptoms¶

Node Condition: Node status reflects DiskPressure: True.
Deployment Stalls: New Pods cannot start on the affected node because the Kubelet cannot pull or extract new container images.
Mass Eviction: Existing Pods are evicted by the Kubelet to reclaim disk space for system stability.

Resolution Steps¶

Identify Disk Usage: Access the node's underlying host filesystem to identify which partition (e.g., /var/lib/containers or /var/log) is at capacity.

oc debug node/<node_name> -- df -h

Clean Up Unused Images: Manually trigger the OpenShift image pruner to remove orphaned or unused container images that have not been garbage-collected.

oc adm prune images

Inspect Log Sizes: Check if the node's log directory is filling up due to high application verbosity or system journal growth.

oc debug node/<node_name> -- du -sh /var/log/*

3. PID Pressure¶

PID Pressure occurs when the number of active processes or threads on the Node exceeds the maximum allowed threshold (PID limit) set by the operating system or the Kubelet.

Common Symptoms¶

Node Condition: Node status reflects PIDPressure: True.
Fork Failures: Applications fail to start new threads or child processes, resulting in "fork: retry: Resource temporarily unavailable" or "failed to create thread" errors in logs.
Service Disruption: Critical system services may fail to restart or spawn necessary sub-processes.

Resolution Steps¶

Count Processes: Access the node host to identify if a specific process is "fork-bombing" or creating excessive threads.

oc debug node/<node_name> -- ps aux | wc -l

Identify Thread-Heavy Pods: Use the node debug shell to check which containers are consuming the highest number of PIDs.

oc debug node/<node_name> -- top -b -n 1 -o +THRDS

Check PID Limits: Verify the current Kubelet configuration to see if the podPidsLimit is set too low for the specific workload requirements.

oc get kubeletconfig -o yaml

4. Scheduling and Remediation¶

If a Node is deemed unhealthy or is under critical resource pressure, it must be isolated to prevent the scheduler from placing additional workloads on it and to allow for safe maintenance.

Operational Commands¶

Cordon the Node: Use this command to mark the node as SchedulingDisabled. This ensures no new Pods are assigned to the node while you investigate the issue.

oc adm cordon <node_name>

Drain the Node: This procedure safely evicts all existing Pods from the node and recreates them on other healthy nodes in the cluster.

oc adm drain <node_name> --ignore-daemonsets --delete-emptydir-data

Uncordon the Node: Once the resource pressure has been resolved and the node is confirmed to be healthy, re-enable scheduling.

oc adm uncordon <node_name>

Escalation Criteria: If a Node remains in a NotReady state after a drain and reboot, or if DiskPressure persists despite pruning images and clearing logs, escalate the incident to the L2 Infrastructure Team for hardware or storage backend inspection.