Troubleshooting Storage and Persistent Volume Synchronization¶

This guide provides diagnostic procedures for resolving issues related to Persistent Volumes (PV) and Persistent Volume Claims (PVC). Storage synchronization issues typically result in Pods being stuck in ContainerCreating or applications failing due to read-only filesystems.

Initial Storage Diagnostics¶

When a workload fails to mount storage or reports disk-related errors, use these commands to verify the state of the storage subsystem:

# Check the status of all PVCs in a specific namespace
oc get pvc -n <namespace>

# View the details and bound status of a specific Persistent Volume
oc get pv <pv_name>

# Check for storage-related events across the namespace
oc get events -n <namespace> --field-selector type=Warning | grep -E "Storage|Volume|Mount"

1. PVC Stuck in Pending¶

A Pending status indicates that the OpenShift control plane cannot satisfy the storage request, preventing the volume from being created or bound to the claim.

Common Causes¶

StorageClass Mismatch: The requested storageClassName in the PVC manifest does not exist in the cluster or is not available in the specific availability zone where the nodes reside.
Insufficient Capacity: The underlying infrastructure provider (AWS EBS, VMware vSphere, Azure Disk, etc.) has reached its resource quota or the physical hardware is out of space.
Incompatible Access Modes: The Pod requires ReadWriteMany (RWX) for multi-node access, but the provisioned StorageClass only supports ReadWriteOnce (RWO).

Resolution Steps¶

Inspect PVC Events: Examine the "Events" section of the PVC description to find the specific error message from the provisioner.

oc describe pvc <pvc_name> -n <namespace>

Verify StorageClass Availability: Ensure the requested StorageClass is present and check its provisioner type.

oc get storageclass

2. Multi-Attach Errors¶

This error occurs when a Persistent Volume is physically attached to one Cluster Node and the storage backend (e.g., AWS EBS, vSphere) prevents it from being attached to a different Node where the Pod has been rescheduled.

Common Symptoms¶

Status Stalled: The Pod remains stuck in the ContainerCreating phase indefinitely.
Events Log: The oc describe pod output shows a FailedAttachVolume warning with the message: Multi-Attach error for volume "pv-name" Volume is already used by pod...

Resolution Steps¶

Identify the Locking Node: Locate the specific VolumeAttachment object to identify which Node currently holds the lock on the Persistent Volume.

oc get volumeattachment | grep <pv_name>

Safe Termination: Ensure the Pod previously using the volume on the "old" Node is completely terminated. If the Node is unresponsive (NotReady), you may need to force-delete the Pod to signal the controller to release the volume.

Force deleting a pod with a bound volume should only be done if you are certain the Node is not actively writing to the disk to avoid data corruption.

oc delete pod <old_pod_name> -n <namespace> --force --grace-period=0

3. FailedMount / Unmountable Volume¶

The Pod is successfully assigned to a Node, but the Node's operating system is unable to physically mount the requested filesystem.

Common Causes¶

Missing Utilities: The Node is missing the required client-side binaries to handle the storage protocol (e.g., nfs-utils for NFS or iscsi-initiator-utils for iSCSI).
Network Latency / Connectivity: The Node cannot establish a connection to the external storage array (e.g., NetApp, Dell EMC, or AWS) due to firewall rules, incorrect routing, or network saturation.

Resolution Steps¶

Check Node-Level Mounts: Access the Node directly via a debug shell to verify if the mount point exists and if there are kernel-level mounting errors.

oc debug node/<node_name>

# Once inside the debug shell:
chroot /host
df -h | grep /var/lib/kubelet/pods/

Verify Secret Access: For authenticated storage (such as CIFS, Azure Files, or encrypted volumes), ensure the required storage secret is present in the namespace and correctly referenced in the PVC or Deployment.

oc get secret -n <namespace>

4. Filesystem Synchronization & Quotas¶

Applications may report errors such as "Disk Full" or "Read-only file system" even when the PVC is successfully bound and the cluster shows the volume as healthy.

Resolution Steps¶

Check Usage from Inside the Pod: Verify the actual disk utilization from the perspective of the application container to rule out internal filesystem exhaustion.

oc exec <pod_name> -n <namespace> -- df -h

Verify Permissions: Ensure the Pod's securityContext (UID/GID) matches the ownership and permissions of the mounted volume. Mismatched permissions often result in "Read-only" errors during write attempts.

oc get pod <pod_name> -n <namespace> -o yaml | grep -A 3 "securityContext"

Expand Volume: If the StorageClass has allowVolumeExpansion set to true, you can resolve capacity issues by increasing the PVC size dynamically.

oc patch pvc <pvc_name> -n <namespace> -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Escalation Criteria: If a Persistent Volume is stuck in a Released state and needs to be manually reclaimed/re-bound, or if the underlying storage backend (vCenter, AWS Console, NetApp) shows hardware-level alerts or connectivity failures, escalate to the L2 Infrastructure/Storage Team.