Troubleshooting Pod Lifecycle Failures¶

This guide provides the standard procedures for diagnosing and resolving common Pod failures within OpenShift Container Platform. When a workload is not in the Running or Ready state, follow the workflows below based on the specific status observed.

Initial Diagnostic Commands¶

Before analyzing specific errors, gather the high-level status of the workload:

# Identify pods that are not Running
oc get pods -A --field-selector=status.phase!=Running

# View events related to a specific pod
oc describe pod <pod_name> -n <namespace>

# Check the last 100 lines of logs (if the container started)
oc logs <pod_name> -n <namespace> --tail=100

1. Pending¶

A Pending status indicates that the Pod cannot be scheduled onto a Node.

Common Causes¶

Insufficient Resources: The cluster lacks enough CPU or Memory to satisfy the Pod's resource requests.
Taints and Tolerations: The Pod does not have the required tolerations to run on the available Nodes.
PVC Issues: The required Persistent Volume Claim (PVC) is not bound, or the storage is located in a different availability zone than the available nodes.

Resolution Steps¶

Inspect Events: Check the Events section at the bottom of the describe output for FailedScheduling messages.

oc describe pod <pod_name> -n <namespace>

Check Node Resources: Verify if the nodes have reached their resource capacity.

oc adm top nodes

Verify PVC Status: Ensure that all volumes required by the pod are successfully bound.

oc get pvc -n <namespace>

2. CrashLoopBackOff¶

This status indicates that the Pod has started, but the application process crashed immediately and OpenShift is attempting to restart it.

Common Causes¶

Application Error: Internal code failure, missing environment variables, or database connection timeouts.
Liveness/Readiness Probe Failure: The health check parameters are too restrictive or the application takes too long to start.
Permission Denied: The ServiceAccount does not have the necessary SCC (Security Context Constraints) to run the container as a specific user.

Resolution Steps¶

Check Previous Logs: View logs from the container instance that crashed to see the error output.

oc logs <pod_name> -n <namespace> --previous

Inspect Probes: Verify if the failure is caused by a health check killing the pod prematurely.

oc describe pod <pod_name> -n <namespace> | grep -A 5 "Probes"

Check Exit Codes: Look for specific termination codes in the container status (e.g., 137 for OOMKilled, 1 for generic error, or 127 for command not found).

oc get pod <pod_name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.exitCode}'

3. ImagePullBackOff / ErrImagePull¶

The node is unable to retrieve the container image from the specified registry.

Common Causes¶

Incorrect Image Path: Typo in the image name, repository path, or tag.
Authentication Failure: The imagePullSecret is missing from the namespace or the registry credentials have expired/changed.
Network Isolation: The Node where the Pod is scheduled cannot reach the external or internal registry (Quay, Artifactory, Docker Hub) due to firewall or proxy issues.

Resolution Steps¶

Verify Image String: Confirm the exact image path and tag the Pod is attempting to use.

oc get pod <pod_name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

Check Pull Events: Inspect the event log specifically for image pull errors to see the exact response from the registry.

oc get events -n <namespace> | grep -i "pull"

Manual Verification: Confirm the image exists in the Registry UI or use the skopeo tool to inspect the remote image without pulling it.

skopeo inspect docker://<image_path>:<tag>

4. OOMKilled (Out of Memory)¶

The Pod was terminated because it exceeded its defined memory limit or the physical Node ran out of available RAM.

Resolution Steps¶

Confirm Termination Reason: Check the pod status specifically for the OOMKilled reason in the container's last state.

oc get pod <pod_name> -n <namespace> -o yaml | grep -A 5 "terminated"

Review Limits: Compare the memory limits defined in the Deployment against the application's actual requirements.

oc describe deployment <deployment_name> -n <namespace>

> Escalation Criteria: If the status is Pending due to FailedScheduling and there are no obvious resource issues, escalate to L2 Infrastructure as this may indicate a problem with the MachineSet or Node taints.