Debugging 5XX: Using AWS ELB logs on a Kubernetes Deployment
Sree Venkat / 2025-02-22
Intro
I recently had a fast api application deployed on Kubernetes with traffic routed from an AWS application load balancer via a Node Port. The issue turned out to be that the client which was interacting with this application kept reporting HTTP 502 response codes on sentry, but there was no log of the failure on the fast-api applications' logs.
Where is the missing 502?
So there was evidence of HTTP 502, but, nothing on the application logs or from Sentry on the fast-api application. Interestingly on the AWS load balancer the monitoring tab displayed ELB 5XX trend which was not matching with the target 5XX trend. This helped indicate that the 502 was not originating from the target.
Then came to the rescue load balancer logs, the devops engineer I was working with had enabled access logs for the load balancer.
https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 502 - 0 57"GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067"Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012"1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-" "10.0.0.1:80" "200" "-" "-" TID_123456
In the same log above, right before the http method you will find a set of space separated numbers where you will find 502 -
. This served as evidence that
the this request not even reach the target. This was happending to the Node Port intentionally dropping some requests since the application service was processing
another request.
Why Node Port Alone Wasn't Sufficient
When using Node Port directly with the AWS Load Balancer, we encountered an architectural limitation. Node Port operates at the kernel level using iptables rules for traffic forwarding. Here's what happens:
- The Load Balancer sends a request to the Node Port
- Node Port uses iptables to immediately attempt forwarding to the target pod
- If the target pod is busy or its connection queue is full:
- iptables immediately drops the connection
- No retry mechanism exists at this layer
- The Load Balancer receives a connection failure
- This results in the 502 errors we observed
Unlike application-layer proxies (like Nginx), Node Port lacks request queuing capabilities. It operates at the network layer (Layer 4) and makes immediate forward/drop decisions based on the current state of the target pod's network stack.
Mitigation
The issue was occurring because the Node Port was immediately dropping requests when the target application was busy processing other requests. To address this, we implemented two key changes:
Added Nginx as a reverse proxy between the Load Balancer and the application service. This helped because:
- Nginx's event-driven architecture allows it to queue and hold incoming requests
- Its connection pooling capabilities help manage concurrent connections more efficiently
- It provides better control over request buffering and timeouts
Tuned the configuration parameters:
- Adjusted worker connections and keepalive settings in Nginx
- Set appropriate timeouts both on the Load Balancer and Nginx level
- Configured proper backlog queue sizes for the Node Port
This setup proved more resilient because Nginx's event loop can hold and queue requests instead of immediately dropping them like the Node Port's direct iptables forwarding mechanism. The Node Port's behavior of immediately attempting to forward requests and dropping them when the host is busy was replaced with Nginx's more sophisticated request handling.