Troubleshooting Pod Lifecycle Failures¶
This guide provides the standard procedures for diagnosing and resolving common Pod failures within OpenShift Container Platform. When a workload is not in the Running or Ready state, follow the workflows below based on the specific status observed.
Initial Diagnostic Commands¶
Before analyzing specific errors, gather the high-level status of the workload:
# Identify pods that are not Running
oc get pods -A --field-selector=status.phase!=Running
# View events related to a specific pod
oc describe pod <pod_name> -n <namespace>
# Check the last 100 lines of logs (if the container started)
oc logs <pod_name> -n <namespace> --tail=100
1. Pending¶
A Pending status indicates that the Pod cannot be scheduled onto a Node.
Common Causes¶
- Insufficient Resources: The cluster lacks enough CPU or Memory to satisfy the Pod's resource requests.
- Taints and Tolerations: The Pod does not have the required tolerations to run on the available Nodes.
- PVC Issues: The required Persistent Volume Claim (PVC) is not bound, or the storage is located in a different availability zone than the available nodes.
Resolution Steps¶
- Inspect Events: Check the
Eventssection at the bottom of the describe output forFailedSchedulingmessages.
oc describe pod <pod_name> -n <namespace>
Check Node Resources: Verify if the nodes have reached their resource capacity.
oc adm top nodes
Verify PVC Status: Ensure that all volumes required by the pod are successfully bound.
oc get pvc -n <namespace>
2. CrashLoopBackOff¶
This status indicates that the Pod has started, but the application process crashed immediately and OpenShift is attempting to restart it.
Common Causes¶
- Application Error: Internal code failure, missing environment variables, or database connection timeouts.
- Liveness/Readiness Probe Failure: The health check parameters are too restrictive or the application takes too long to start.
- Permission Denied: The ServiceAccount does not have the necessary SCC (Security Context Constraints) to run the container as a specific user.
Resolution Steps¶
- Check Previous Logs: View logs from the container instance that crashed to see the error output.
oc logs <pod_name> -n <namespace> --previous
Inspect Probes: Verify if the failure is caused by a health check killing the pod prematurely.
oc describe pod <pod_name> -n <namespace> | grep -A 5 "Probes"
Check Exit Codes: Look for specific termination codes in the container status (e.g., 137 for OOMKilled, 1 for generic error, or 127 for command not found).
oc get pod <pod_name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.exitCode}'
3. ImagePullBackOff / ErrImagePull¶
The node is unable to retrieve the container image from the specified registry.
Common Causes¶
- Incorrect Image Path: Typo in the image name, repository path, or tag.
- Authentication Failure: The
imagePullSecretis missing from the namespace or the registry credentials have expired/changed. - Network Isolation: The Node where the Pod is scheduled cannot reach the external or internal registry (Quay, Artifactory, Docker Hub) due to firewall or proxy issues.
Resolution Steps¶
- Verify Image String: Confirm the exact image path and tag the Pod is attempting to use.
oc get pod <pod_name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'
Check Pull Events: Inspect the event log specifically for image pull errors to see the exact response from the registry.
oc get events -n <namespace> | grep -i "pull"
Manual Verification: Confirm the image exists in the Registry UI or use the skopeo tool to inspect the remote image without pulling it.
skopeo inspect docker://<image_path>:<tag>
4. OOMKilled (Out of Memory)¶
The Pod was terminated because it exceeded its defined memory limit or the physical Node ran out of available RAM.
Resolution Steps¶
- Confirm Termination Reason: Check the pod status specifically for the
OOMKilledreason in the container's last state.
oc get pod <pod_name> -n <namespace> -o yaml | grep -A 5 "terminated"
Review Limits: Compare the memory limits defined in the Deployment against the application's actual requirements.
oc describe deployment <deployment_name> -n <namespace>
> Escalation Criteria: If the status is Pending due to FailedScheduling and there are no obvious resource issues, escalate to L2 Infrastructure as this may indicate a problem with the MachineSet or Node taints.