Kubernetes Troubleshooting Guide
Debug Kubernetes like a pro in 2025. Learn systematic approaches to solving pod failures, networking issues, and resource problems with battle-tested solutions.
The Systematic Approach to K8s Debugging
Kubernetes troubleshooting in 2025 requires a methodical approach. Instead of randomly running commands, follow this diagnostic flow:
- Check pod status:
kubectl get pods - Describe the problem resource:
kubectl describe pod [name] - Check logs:
kubectl logs [pod] - Check events:
kubectl get events --sort-by='.lastTimestamp' - Verify resources (CPU, memory, storage)
- Test networking if applicable
Common Kubernetes Issues & Solutions
Issue: CrashLoopBackOff
NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 CrashLoopBackOff 5 3m
What This Means
Your pod starts, crashes immediately, and Kubernetes keeps trying to restart it. After several failed attempts, K8s increases the delay between restarts (backoff), hence "CrashLoopBackOff".
Diagnostic Steps
1. Check the logs for errors:
kubectl logs my-app-xyz\n\n# If the pod restarted, check previous logs\nkubectl logs my-app-xyz --previousThis is the most important step. The logs usually tell you exactly what's wrong.
2. Describe the pod for more details:
kubectl describe pod my-app-xyzLook at the "Events" section at the bottom. It shows the restart history and reasons.
3. Check the pod's exit code:
In the describe output, look for "Last State" → "Exit Code"
- • Exit Code 0: Successful (but pod should stay running)
- • Exit Code 1: Application error (check logs)
- • Exit Code 137: Pod killed (OOMKilled - out of memory)
- • Exit Code 143: Graceful termination (SIGTERM)
Common Causes & Fixes
1. Application Error
Your application code has a bug that causes it to crash immediately.
Fix:
Check the application logs. Fix the bug in your code and redeploy.
2. Missing Environment Variables
App crashes because required config (DATABASE_URL, API_KEY) is missing.
Fix:
# Check if env vars are set
kubectl describe pod my-app-xyz | grep -A 10 Environment
# Add missing vars to your deployment
kubectl set env deployment/my-app DATABASE_URL=postgres://...3. OOMKilled (Out of Memory)
Pod uses more memory than its limit and gets killed by Kubernetes.
Fix:
# Increase memory limits in your deployment
resources:
limits:
memory: "512Mi" # Increase this
requests:
memory: "256Mi"4. Command/Args Misconfiguration
Wrong command or missing arguments in pod spec.
Fix:
# Verify the command
kubectl get pod my-app-xyz -o jsonpath='{.spec.containers[0].command}'
# Fix in deployment
command: ["node"]
args: ["server.js"] # Make sure this is correctIssue: ImagePullBackOff / ErrImagePull
NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 ImagePullBackOff 0 2m
What This Means
Kubernetes can't pull your container image from the registry. This could be authentication, network, or image naming issues.
Diagnostic Steps
1. Get detailed error:
kubectl describe pod my-app-xyz | grep -A 10 Events2. Verify the image name:
kubectl get pod my-app-xyz -o jsonpath='{.spec.containers[0].image}'Common Causes & Fixes
1. Image Doesn't Exist
Typo in image name, wrong tag, or image was never pushed.
Fix:
# Check if image exists
docker pull myregistry.com/my-app:v1.0.0
# Verify tag in deployment
kubectl set image deployment/my-app my-app=myregistry.com/my-app:v1.0.12. Private Registry Authentication
Image is private but no credentials provided.
Fix:
# Create Docker registry secret
kubectl create secret docker-registry regcred \\
--docker-server=myregistry.com \\
--docker-username=myuser \\
--docker-password=mypassword \\
--docker-email=my@email.com
# Reference in deployment
spec:
imagePullSecrets:
- name: regcred
containers:
- name: my-app
image: myregistry.com/my-app:v1.0.03. Network/Firewall Issues
Cluster can't reach the registry due to network policies or firewall.
Fix:
# Test connectivity from a debug pod
kubectl run curl-test --image=curlimages/curl -it --rm -- sh
# Inside the pod:
curl -I https://myregistry.comIssue: Pod Stuck in Pending State
NAME READY STATUS RESTARTS AGE
my-app-xyz 0/1 Pending 0 5m
What This Means
The pod can't be scheduled onto any node. The scheduler couldn't find a suitable node that meets the pod's requirements.
Diagnostic Steps
1. Check why it's pending:
kubectl describe pod my-app-xyz | grep -A 10 EventsLook for messages like "Insufficient cpu", "Insufficient memory", or "No nodes available"
2. Check node resources:
kubectl top nodes\nkubectl describe nodesCommon Causes & Fixes
1. Insufficient Resources
No node has enough CPU or memory to run your pod.
Fix:
# Option 1: Reduce pod resource requests
resources:
requests:
memory: "128Mi" # Reduce this
cpu: "100m" # Or this
# Option 2: Add more nodes to cluster
# Option 3: Remove unused pods to free resources
kubectl delete pod unused-pod2. PVC (Persistent Volume) Not Available
Pod requires a volume that doesn't exist or is already bound.
Fix:
# Check PVC status
kubectl get pvc
# If PVC is Pending, check storage class
kubectl get storageclass
kubectl describe pvc my-pvc3. Node Selector / Affinity Mismatch
Pod has node selector or affinity rules that no node satisfies.
Fix:
# Check pod's node selector
kubectl get pod my-app-xyz -o yaml | grep -A 5 nodeSelector
# Check node labels
kubectl get nodes --show-labels
# Either fix the selector or add labels to nodes
kubectl label nodes node1 disktype=ssd4. Taints and Tolerations
All nodes are tainted and pod doesn't have matching tolerations.
Fix:
# Check node taints
kubectl describe nodes | grep Taints
# Add toleration to pod
spec:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"Issue: Service Not Accessible / Connection Refused
Diagnostic Steps
1. Verify pods are running:
kubectl get pods -l app=my-app2. Check service configuration:
kubectl get svc my-service\nkubectl describe svc my-serviceLook for "Endpoints" - should show pod IPs. If empty, selector is wrong.
3. Verify port configuration:
# Check if service port matches container port
kubectl get svc my-service -o yaml | grep -A 3 ports
kubectl get pod my-pod -o yaml | grep -A 5 ports4. Test from inside the cluster:
# Create a debug pod
kubectl run curl-test --image=curlimages/curl -it --rm -- sh
# Test the service
curl http://my-service:8080
# Or test pod directly
curl http://[pod-ip]:8080Common Fixes
1. Selector Mismatch
# Service selector must match pod labels
# Service:
spec:
selector:
app: my-app
# Pod must have:
metadata:
labels:
app: my-app2. Wrong Port
# Service port must route to correct container port
spec:
ports:
- port: 80 # External port
targetPort: 8080 # Must match container portIssue: ConfigMap or Secret Not Loading
Diagnostic Steps
1. Verify ConfigMap/Secret exists:
kubectl get configmap\nkubectl get secret\nkubectl describe configmap my-config2. Check if keys match:
# List keys in ConfigMap
kubectl get configmap my-config -o yaml
# Verify pod references correct keys
kubectl get pod my-pod -o yaml | grep -A 10 envFrom3. Check the namespace:
# ConfigMaps/Secrets must be in same namespace as pod
kubectl get configmap -n my-namespaceCommon Issues
Important: If you update a ConfigMap or Secret, pods don't automatically reload. You need to restart them:
kubectl rollout restart deployment my-appEssential Debugging Commands for 2025
Pod Debugging
kubectl get pods -AList all pods in all namespaces
kubectl logs [pod] --previousLogs from previous container
kubectl exec -it [pod] -- shShell into running container
kubectl debug [pod] -it --image=busyboxDebug with ephemeral container (K8s 1.23+)
Resource Debugging
kubectl top nodesNode resource usage
kubectl top podsPod resource usage
kubectl get events --sort-by='.lastTimestamp'Recent cluster events
kubectl get all -AAll resources in all namespaces