Kubernetes Troubleshooting Guide

Debug Kubernetes like a pro in 2025. Learn systematic approaches to solving pod failures, networking issues, and resource problems with battle-tested solutions.

Updated 2025
20 min read
Intermediate-Advanced

The Systematic Approach to K8s Debugging

Kubernetes troubleshooting in 2025 requires a methodical approach. Instead of randomly running commands, follow this diagnostic flow:

  1. Check pod status: kubectl get pods
  2. Describe the problem resource: kubectl describe pod [name]
  3. Check logs: kubectl logs [pod]
  4. Check events: kubectl get events --sort-by='.lastTimestamp'
  5. Verify resources (CPU, memory, storage)
  6. Test networking if applicable

Common Kubernetes Issues & Solutions

Issue: CrashLoopBackOff

NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 CrashLoopBackOff 5 3m

What This Means

Your pod starts, crashes immediately, and Kubernetes keeps trying to restart it. After several failed attempts, K8s increases the delay between restarts (backoff), hence "CrashLoopBackOff".

Diagnostic Steps

1. Check the logs for errors:

kubectl logs my-app-xyz\n\n# If the pod restarted, check previous logs\nkubectl logs my-app-xyz --previous

This is the most important step. The logs usually tell you exactly what's wrong.

2. Describe the pod for more details:

kubectl describe pod my-app-xyz

Look at the "Events" section at the bottom. It shows the restart history and reasons.

3. Check the pod's exit code:

In the describe output, look for "Last State" → "Exit Code"

  • • Exit Code 0: Successful (but pod should stay running)
  • • Exit Code 1: Application error (check logs)
  • • Exit Code 137: Pod killed (OOMKilled - out of memory)
  • • Exit Code 143: Graceful termination (SIGTERM)

Common Causes & Fixes

1. Application Error

Your application code has a bug that causes it to crash immediately.

Fix:

Check the application logs. Fix the bug in your code and redeploy.

2. Missing Environment Variables

App crashes because required config (DATABASE_URL, API_KEY) is missing.

Fix:

# Check if env vars are set
kubectl describe pod my-app-xyz | grep -A 10 Environment

# Add missing vars to your deployment
kubectl set env deployment/my-app DATABASE_URL=postgres://...

3. OOMKilled (Out of Memory)

Pod uses more memory than its limit and gets killed by Kubernetes.

Fix:

# Increase memory limits in your deployment
resources:
  limits:
    memory: "512Mi"  # Increase this
  requests:
    memory: "256Mi"

4. Command/Args Misconfiguration

Wrong command or missing arguments in pod spec.

Fix:

# Verify the command
kubectl get pod my-app-xyz -o jsonpath='{.spec.containers[0].command}'

# Fix in deployment
command: ["node"]
args: ["server.js"]  # Make sure this is correct

Issue: ImagePullBackOff / ErrImagePull

NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 ImagePullBackOff 0 2m

What This Means

Kubernetes can't pull your container image from the registry. This could be authentication, network, or image naming issues.

Diagnostic Steps

1. Get detailed error:

kubectl describe pod my-app-xyz | grep -A 10 Events

2. Verify the image name:

kubectl get pod my-app-xyz -o jsonpath='{.spec.containers[0].image}'

Common Causes & Fixes

1. Image Doesn't Exist

Typo in image name, wrong tag, or image was never pushed.

Fix:

# Check if image exists
docker pull myregistry.com/my-app:v1.0.0

# Verify tag in deployment
kubectl set image deployment/my-app my-app=myregistry.com/my-app:v1.0.1

2. Private Registry Authentication

Image is private but no credentials provided.

Fix:

# Create Docker registry secret
kubectl create secret docker-registry regcred \\
  --docker-server=myregistry.com \\
  --docker-username=myuser \\
  --docker-password=mypassword \\
  --docker-email=my@email.com

# Reference in deployment
spec:
  imagePullSecrets:
  - name: regcred
  containers:
  - name: my-app
    image: myregistry.com/my-app:v1.0.0

3. Network/Firewall Issues

Cluster can't reach the registry due to network policies or firewall.

Fix:

# Test connectivity from a debug pod
kubectl run curl-test --image=curlimages/curl -it --rm -- sh
# Inside the pod:
curl -I https://myregistry.com

Issue: Pod Stuck in Pending State

NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 Pending 0 5m

What This Means

The pod can't be scheduled onto any node. The scheduler couldn't find a suitable node that meets the pod's requirements.

Diagnostic Steps

1. Check why it's pending:

kubectl describe pod my-app-xyz | grep -A 10 Events

Look for messages like "Insufficient cpu", "Insufficient memory", or "No nodes available"

2. Check node resources:

kubectl top nodes\nkubectl describe nodes

Common Causes & Fixes

1. Insufficient Resources

No node has enough CPU or memory to run your pod.

Fix:

# Option 1: Reduce pod resource requests
resources:
  requests:
    memory: "128Mi"  # Reduce this
    cpu: "100m"      # Or this

# Option 2: Add more nodes to cluster
# Option 3: Remove unused pods to free resources
kubectl delete pod unused-pod

2. PVC (Persistent Volume) Not Available

Pod requires a volume that doesn't exist or is already bound.

Fix:

# Check PVC status
kubectl get pvc

# If PVC is Pending, check storage class
kubectl get storageclass
kubectl describe pvc my-pvc

3. Node Selector / Affinity Mismatch

Pod has node selector or affinity rules that no node satisfies.

Fix:

# Check pod's node selector
kubectl get pod my-app-xyz -o yaml | grep -A 5 nodeSelector

# Check node labels
kubectl get nodes --show-labels

# Either fix the selector or add labels to nodes
kubectl label nodes node1 disktype=ssd

4. Taints and Tolerations

All nodes are tainted and pod doesn't have matching tolerations.

Fix:

# Check node taints
kubectl describe nodes | grep Taints

# Add toleration to pod
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"

Issue: Service Not Accessible / Connection Refused

Diagnostic Steps

1. Verify pods are running:

kubectl get pods -l app=my-app

2. Check service configuration:

kubectl get svc my-service\nkubectl describe svc my-service

Look for "Endpoints" - should show pod IPs. If empty, selector is wrong.

3. Verify port configuration:

# Check if service port matches container port
kubectl get svc my-service -o yaml | grep -A 3 ports
kubectl get pod my-pod -o yaml | grep -A 5 ports

4. Test from inside the cluster:

# Create a debug pod
kubectl run curl-test --image=curlimages/curl -it --rm -- sh

# Test the service
curl http://my-service:8080
# Or test pod directly
curl http://[pod-ip]:8080

Common Fixes

1. Selector Mismatch

# Service selector must match pod labels
# Service:
spec:
  selector:
    app: my-app

# Pod must have:
metadata:
  labels:
    app: my-app

2. Wrong Port

# Service port must route to correct container port
spec:
  ports:
  - port: 80          # External port
    targetPort: 8080  # Must match container port

Issue: ConfigMap or Secret Not Loading

Diagnostic Steps

1. Verify ConfigMap/Secret exists:

kubectl get configmap\nkubectl get secret\nkubectl describe configmap my-config

2. Check if keys match:

# List keys in ConfigMap
kubectl get configmap my-config -o yaml

# Verify pod references correct keys
kubectl get pod my-pod -o yaml | grep -A 10 envFrom

3. Check the namespace:

# ConfigMaps/Secrets must be in same namespace as pod
kubectl get configmap -n my-namespace

Common Issues

Important: If you update a ConfigMap or Secret, pods don't automatically reload. You need to restart them:

kubectl rollout restart deployment my-app

Essential Debugging Commands for 2025

Pod Debugging

kubectl get pods -A

List all pods in all namespaces

kubectl logs [pod] --previous

Logs from previous container

kubectl exec -it [pod] -- sh

Shell into running container

kubectl debug [pod] -it --image=busybox

Debug with ephemeral container (K8s 1.23+)

Resource Debugging

kubectl top nodes

Node resource usage

kubectl top pods

Pod resource usage

kubectl get events --sort-by='.lastTimestamp'

Recent cluster events

kubectl get all -A

All resources in all namespaces

Quick Debugging Checklist

Check pod status and restarts
Review pod logs (current and previous)
Describe the pod for detailed events
Verify resource limits and requests
Check service endpoints and selectors
Verify ConfigMaps and Secrets exist
Test connectivity from debug pod
Review recent events in the namespace