Operational Procedures (SOPs)¶
This document outlines the Standard Operating Procedures for routine L1 administrative tasks within the OpenShift Container Platform. Adherence to these workflows ensures security compliance and operational stability.
1. User Management & Access Control¶
L1 engineers are responsible for verifying access and managing basic permissions.
Common Tasks¶
- Verify Current Permissions: Check what actions a specific user can perform within a given namespace to troubleshoot "Permission Denied" errors.
oc adm policy who-can <verb> <resource> -n <namespace>
# Example: Check who can delete pods in the production namespace
oc adm policy who-can delete pods -n production
Check Group Membership: Identify which groups a user belongs to, as permissions are often inherited via groups.
oc get groups | grep <username>
Escalation Policy
Escalation Required: If a user requires a new ClusterRoleBinding (cluster-wide permissions) or needs integration with a new LDAP/Identity Provider, escalate the request to the L2 IAM Team.
2. Secret and ConfigMap Management¶
Procedures for managing application configurations and sensitive data. Proper handling is critical to ensure application security and uptime.
Secret Rotation (Manual)¶
When a password or API key needs to be updated, follow these steps to ensure a safe transition.
-
Back up the existing secret: Always create a local backup before making manual changes.
oc get secret <secret_name> -n <namespace> -o yaml > secret_backup.yaml -
Update the data: Encode the new value in
base64and apply the patch to the secret.# Generate the base64 string echo -n 'new_password' | base64 # Patch the existing secret with the new value oc patch secret <secret_name> -p='{"data":{"password":"<base64_value>"}}' -
Trigger App Refresh: Environment variables sourced from secrets are not automatically injected into running processes. You must restart the deployment to pick up the changes.
oc rollout restart deployment/<deployment_name> -n <namespace>
3. Certificate Monitoring¶
Monitoring the validity of internal and external certificates is vital to prevent service outages, "Invalid Certificate" browser warnings, and broken internal API communications.
Checking Expiration¶
- Check Cluster Operator Certificates: Monitor the status of certificates managed by the cluster configuration.
oc get secrets -n openshift-config-managed | grep certificates
- Verify Route Expiry: For external-facing HTTPS routes, manually verify the expiration date of the provided certificate.
oc get route <route_name> -n <namespace> -o jsonpath='{.spec.tls.certificate}' | openssl x509 -enddate -noout
Resolution¶
- Automatic Rotation: If a Service CA certificate (internal) is approaching expiration, OpenShift is designed to rotate it automatically.
- Failure Protocol: If the automatic rotation fails, or if a manual Route certificate has expired, do not attempt to delete the certificate. Escalate the ticket immediately to the L2 Infrastructure Team.
4. Escalation Matrix¶
When L1 troubleshooting steps (such as log analysis, resource checks, or Cordon/Drain procedures) do not resolve the incident, follow this matrix to transition the ticket to the appropriate team.
Support Tiers¶
| Level | Role | Trigger Criteria |
|---|---|---|
| L1 | Support Analyst | Initial triage, log gathering, pod restarts, and basic storage/network connectivity checks. |
| L2 | Cluster Admin | Persistent node failures, SDN/Networking issues, or backend storage errors. |
| L3 | SRE / Architect | Cluster-wide performance degradation, API Server instability, or Etcd database corruption. |
| Vendor | Red Hat Support | Confirmed product bugs, core OpenShift platform failure, or as directed by L3. |