Kubernetes Ingress Certificate Expiry: Don't Let Your Cluster Go Dark

You’ve built a robust application, deployed it to Kubernetes, and exposed it to the world through an Ingress. Traffic is flowing, users are happy, and everything is humming along. Then, suddenly, it's not. Users report NET::ERR_CERT_DATE_INVALID errors, and your service is effectively offline. The culprit? An expired TLS certificate on your Kubernetes Ingress.

This scenario is all too common, even for seasoned SREs and platform engineers. In the dynamic world of Kubernetes, where services come and go, and automation reigns supreme, it’s easy to assume certificates are just another component that "takes care of itself." But as anyone who's faced an unexpected outage knows, assumptions are the enemy of uptime. This article will dive into why Kubernetes ingress certificate expiry is a unique challenge, explore common pitfalls, and discuss how you can proactively prevent these disruptive events.

The Kubernetes Ingress Landscape

At its core, a Kubernetes Ingress is an API object that manages external access to services in a cluster, typically HTTP and HTTPS. It acts as a layer 7 load balancer, providing routing rules, traffic management, and, crucially, TLS termination.

When you configure an Ingress for HTTPS, you typically specify a tls section, which references a Kubernetes Secret containing your TLS certificate and private key. This Secret holds the tls.crt (certificate) and tls.key (private key) fields, Base64-encoded.

Consider a simple Ingress definition:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  tls:
  - hosts:
    - myapp.example.com
    secretName: my-app-tls-secret # This secret holds the certificate
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80

The my-app-tls-secret is where your certificate lives. This secret can be populated in several ways:

  • Manually: You generate a certificate (e.g., with OpenSSL or a commercial CA), create a Kubernetes Secret from it, and reference it in your Ingress.
  • Automated with cert-manager: This is the most common and recommended approach for managing certificates in Kubernetes. cert-manager automates the issuance and renewal of TLS certificates from various issuing sources like Let's Encrypt, HashiCorp Vault, or private CAs. It creates and updates the Secret for you.
  • Cloud Provider Integrations: Cloud providers like AWS (with ALB Ingress Controller) or GCP (with GKE Ingress) might integrate directly with their certificate management services (ACM, Certificate Manager) to provision and renew certificates, often abstracting away the Kubernetes Secret entirely.

Regardless of how the certificate gets into the cluster or onto the load balancer, it will expire. And when it does, your users are locked out.

Why Cert Expiry is a Unique Challenge in Kubernetes

You might think, "I use cert-manager, so I'm safe!" Or, "My cloud provider handles it." While these tools significantly reduce the manual burden, they don't eliminate the need for proactive monitoring. Here’s why:

  1. Distributed Nature: A typical Kubernetes cluster might have dozens, even hundreds, of Ingress resources across multiple namespaces. If you manage multiple clusters, this problem scales linearly. Keeping track of every single certificate and its expiry date manually is a Sisyphean task.
  2. cert-manager Can Fail: cert-manager is excellent, but it's not infallible. Renewal failures can occur due to:
    • DNS propagation issues: Especially with wildcard certificates or complex DNS setups.
    • ACME challenge failures: Network issues, firewall rules, or misconfigured Ingress rules can prevent the ACME server from validating ownership.
    • API rate limits: Hitting Let's Encrypt rate limits during a mass renewal event.
    • Quota or permissions issues: cert-manager might lack permissions to update DNS records or create/update Secrets.
    • Bug in cert-manager or underlying Kubernetes components: Rare, but it happens. When cert-manager fails, it often leaves events and conditions on the Certificate resource, but you need to actively monitor those events or conditions to catch the problem before it's too late.
  3. Manual Certificates Are Often Overlooked: Not every certificate is managed by cert-manager. You might have:
    • Certificates for internal tools that aren't publicly accessible but still use HTTPS.
    • Certificates for services that use self-signed CAs or internal PKI.
    • Certificates imported from legacy systems. These "one-off" certificates are prime candidates for unexpected expiry because they lack the automation of cert-manager.
  4. Ephemeral Workloads & Dynamic Secrets: Standard host-based monitoring might not easily discover certificates embedded within Kubernetes Secrets. The certificates are not files on a host's filesystem in the traditional sense; they're data within the Kubernetes API.
  5. Ingress Controller Specifics: While the underlying problem is universal, different Ingress controllers (Nginx, Traefik, GKE Ingress, AWS ALB Ingress Controller, Istio Gateways) might handle TLS secrets slightly differently. Some might pull directly from secrets, others might provision cloud load balancers. A comprehensive monitoring solution needs to account for this diversity.

Common Approaches to Monitoring (and their Shortcomings)

Engineers often try to roll their own solutions before realizing the complexity involved. Let's look at some common attempts and why they might fall short at scale.

Manual Checks

The most basic approach involves manually inspecting secrets. You can get the expiry date of a certificate stored in a Kubernetes Secret like this:

# Replace 'my-app-tls-secret' and 'my-namespace' with your values
kubectl get secret my-app-tls-secret -n my-namespace -o jsonpath='{.data.tls\.crt}' | \
  base64 --decode | openssl x509 -noout -enddate

This command will output something like notAfter=Dec 31 12:00:00 2024 GMT. This works for one certificate, but imagine doing this for hundreds across multiple clusters. It's tedious, error-prone, and not scalable.

Scripting with kubectl and openssl

A step up from manual checks is to write a script that iterates through all tls type secrets in your cluster, extracts the certificate, checks its expiry, and potentially sends an alert.

```bash

!/bin/bash

THRESHOLD_DAYS=30 NAMESPACE="default" # Or iterate through all namespaces

echo "Checking certificates in namespace: $NAMESPACE"

for SECRET_NAME in $(kubectl get secrets -n "$NAMESPACE" --field-selector type=kubernetes.io/tls -o jsonpath='{.items[*].metadata.name}'); do echo "Processing secret: $SECRET_NAME" CERT_DATA=$(kubectl get secret "$SECRET_NAME" -n "$NAMESPACE" -o jsonpath='{.data.tls.