Troubleshooting OpenShift Container Platform 4.x: Authentication¶

Prerequisite - retrieve kubeconfig file to communicate with the API server of the cluster¶

Connect to the cluster as system:admin using the kubeconfig file generated at installation time, as described in the solution OpenShift 4 system:admin kubeconfig file.

The EOF error is a symptom caused by underlying components not working as expected and affecting the behaviour of the authentication cluster operator. The main reasons are the presence of CSR in Pending status or the CNI not in a healthy status. The error message from CLI is like this:

Raw

$ oc login https://api.<clustername>.<domain>:6443 --insecure-skip-tls-verify=true  -v=10
...
F1106 10:40:33.771570    6911 helpers.go:116] error: EOF
````

## Pending CSR

Follow the prerequisite step to interact with the cluster.
Then, verify if any **CSR** in **Pending** status is present:

**Raw**

```text
$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-244x8   13h     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

If CSR in Pending status are present, approve all of them with the following command:

Raw

$ oc adm certificate approve <CSR name>

NOTE: Check again after approval if any new CSR is generated.

Networking issues - router pods are not running¶

Follow the prerequisite step to interact with the cluster. If no CSR waiting for being approved are present, then the issue can be caused by the network components. So, verify the status of the router pods, they must be in Running status without restarts like in the following example:

Raw

$ oc get pods -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-75d55f4fb4-c8dzg   1/1     Running   0          30h
router-default-75d55f4fb4-fd7gz   1/1     Running   0          30h

If their status is different from Running or if they have lots of restarts, describe the pods to look for any explicit error:

Raw

$ oc describe pods <router pod name> -n openshift-ingress

Or look at the pod logs:

Raw

$ oc logs <router pod name> -n openshift-ingress

Networking issues - SDN not healthy¶

Follow the prerequisite step to interact with the cluster. Then, verify the status of the SDN pods, they must be in Running status without restarts as for the following example:

Raw

$ oc get pods -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE
ovs-4sbkl              1/1     Running   0          30h
ovs-5t5pn              1/1     Running   0          30h
ovs-dzrcc              1/1     Running   0          30h
ovs-fbqbs              1/1     Running   0          30h
ovs-gc2gh              1/1     Running   0          30h
ovs-mqbqp              1/1     Running   0          30h
sdn-88p7h              2/2     Running   0          30h
sdn-cjz4p              2/2     Running   0          30h
sdn-controller-5nkxv   1/1     Running   0          30h
sdn-controller-c92cn   1/1     Running   0          30h
sdn-controller-f6hbt   1/1     Running   0          30h
sdn-gskgx              2/2     Running   0          30h
sdn-mzpnx              2/2     Running   0          30h
sdn-q5pmh              2/2     Running   0          30h
sdn-qw6pt              2/2     Running   0          30h

If one or more of them are in CrashLoopBackOff status verify first if the solution OVS and SDN Pods in CrashLoopbackOff after upgrade to OpenShift 4.6 can be helpful. Otherwise, follow the solution Troubleshooting OpenShift Container Platform 4.x: openshift-sdn to address the issue.

Authentication Cluster Operator¶

Follow the prerequisite step to interact with the cluster. If with the previous steps the issue is not fixed, then verify the status of the authentication pods, they must be in Running status without restarts:

Raw

$ oc get pods -n openshift-authentication
NAME                                  READY   STATUS    RESTARTS   AGE
pod/oauth-openshift-bf6f48656-4r6mz   1/1     Running   0          9m4s
pod/oauth-openshift-bf6f48656-gztld   1/1     Running   0          8m58s

If their status is different from Running or if they have lots of restarts, describe the pods to looking for any explicit error:

Raw

$ oc describe pods <oauth pod name> -n openshift-authentication

Or look at the pod logs:

Raw

$ oc logs <oauth pod name> -n openshift-authentication

Verify also the status of the authentication route, it must be present as shown in this example:

Raw

$ oc get route -n openshift-authentication
NAME                                       HOST/PORT                                     PATH              SERVICES          PORT   TERMINATION            WILDCARD
route.route.openshift.io/oauth-openshift   oauth-openshift.apps.<clustername>.<domain>   oauth-openshift   6443              passthrough/Redirect   None

If the route is missing and the steps described in the Pending CSR or Networking issues sections were already done, look at the authentication cluster operator for any explicit error:

Raw

$ oc describe co authentication

$ oc logs -n openshift-authentication-operator $(oc get po -o name -n openshift-authentication-operator)

In case of ingress certificate expired it is no more possible to access the cluster via the OpenShift Container Platform web console or the OpenShift CLI (oc) and an error similar to the following is returned:

Raw

$ oc login -u kubeadmin [https://api.cluster.example.com:6443](https://api.cluster.example.com:6443)
error: x509: certificate has expired or is not yet valid: current time 2021-10-21T08:33:38+01:00 is after 2021-09-20T19:48:38Z

To fix this issue is necessary first to follow the Prerequisite - retrieve kubeconfig file to communicate with the API server of the cluster to gain cluster administrator access.

Then, determine how the ingress controller is configured. If the following command return empty output, like:

Raw

$ oc get ingresscontroller -n openshift-ingress-operator -o jsonpath='{.items[].spec.defaultCertificate}{"\n"}'

This means that the default ingress certificate is in use and this step should be followed: Default ingress certificate.

Otherwise, if the output returned contains the name of the secret, like this:

Raw

$ oc get ingresscontroller -n openshift-ingress-operator -o jsonpath='{.items[].spec.defaultCertificate}{"\n"}'
{"name":"custom-cert"}

This means that a custom ingress certificate is in use and this step should be followed: Custom ingress certificate.

Default ingress certificate¶

The default ingress certificate is not automatically rotated because it is expected to be replaced after cluster installation, as stated here. So, to manually rotate it is sufficient to follow this solution: How to redeploy the default ingress certificate in OCP 4.

Custom ingress certificate¶

In case of a custom certificate the certificate must be replaced with a new one following the documentation: Replacing the default ingress certificate.

Let me know if you need any further adjustments to the formatting.

Troubleshooting OpenShift Container Platform 4.x: Authentication¶

Prerequisite - retrieve kubeconfig file to communicate with the API server of the cluster¶

Unable to login due to EOF¶

Networking issues - router pods are not running¶

Networking issues - SDN not healthy¶

Authentication Cluster Operator¶

Unable to login due to certificate expired¶

Default ingress certificate¶

Custom ingress certificate¶