Skip to content

Operational Procedures (SOPs)

This document outlines the Standard Operating Procedures for routine L1 administrative tasks within the OpenShift Container Platform. Adherence to these workflows ensures security compliance and operational stability.


1. User Management & Access Control

L1 engineers are responsible for verifying access and managing basic permissions.

Common Tasks

  • Verify Current Permissions: Check what actions a specific user can perform within a given namespace to troubleshoot "Permission Denied" errors.
oc adm policy who-can <verb> <resource> -n <namespace>

# Example: Check who can delete pods in the production namespace
oc adm policy who-can delete pods -n production

Check Group Membership: Identify which groups a user belongs to, as permissions are often inherited via groups.

oc get groups | grep <username>

Escalation Policy

Escalation Required: If a user requires a new ClusterRoleBinding (cluster-wide permissions) or needs integration with a new LDAP/Identity Provider, escalate the request to the L2 IAM Team.

2. Secret and ConfigMap Management

Procedures for managing application configurations and sensitive data. Proper handling is critical to ensure application security and uptime.

Secret Rotation (Manual)

When a password or API key needs to be updated, follow these steps to ensure a safe transition.

  1. Back up the existing secret: Always create a local backup before making manual changes.

    oc get secret <secret_name> -n <namespace> -o yaml > secret_backup.yaml
    
  2. Update the data: Encode the new value in base64 and apply the patch to the secret.

    # Generate the base64 string
    echo -n 'new_password' | base64
    
    # Patch the existing secret with the new value
    oc patch secret <secret_name> -p='{"data":{"password":"<base64_value>"}}'
    
  3. Trigger App Refresh: Environment variables sourced from secrets are not automatically injected into running processes. You must restart the deployment to pick up the changes.

    oc rollout restart deployment/<deployment_name> -n <namespace>
    

3. Certificate Monitoring

Monitoring the validity of internal and external certificates is vital to prevent service outages, "Invalid Certificate" browser warnings, and broken internal API communications.

Checking Expiration

  • Check Cluster Operator Certificates: Monitor the status of certificates managed by the cluster configuration.
oc get secrets -n openshift-config-managed | grep certificates
  • Verify Route Expiry: For external-facing HTTPS routes, manually verify the expiration date of the provided certificate.
    oc get route <route_name> -n <namespace> -o jsonpath='{.spec.tls.certificate}' | openssl x509 -enddate -noout
    

Resolution

  • Automatic Rotation: If a Service CA certificate (internal) is approaching expiration, OpenShift is designed to rotate it automatically.
  • Failure Protocol: If the automatic rotation fails, or if a manual Route certificate has expired, do not attempt to delete the certificate. Escalate the ticket immediately to the L2 Infrastructure Team.

4. Escalation Matrix

When L1 troubleshooting steps (such as log analysis, resource checks, or Cordon/Drain procedures) do not resolve the incident, follow this matrix to transition the ticket to the appropriate team.

Support Tiers

Level Role Trigger Criteria
L1 Support Analyst Initial triage, log gathering, pod restarts, and basic storage/network connectivity checks.
L2 Cluster Admin Persistent node failures, SDN/Networking issues, or backend storage errors.
L3 SRE / Architect Cluster-wide performance degradation, API Server instability, or Etcd database corruption.
Vendor Red Hat Support Confirmed product bugs, core OpenShift platform failure, or as directed by L3.