Autoscaling workers on queue depth, not CPU

21 November 2023

We had a set of background workers that read messages off a queue and processed them. The queue is Google Pub/Sub. The workers run on Kubernetes, and they were autoscaled by an HPA that watched CPU. On paper that sounds reasonable. In practice it scaled at the wrong times, and sometimes not at all.

The symptom was a growing backlog. Messages would pile up, the lag would climb into the tens of minutes, and the workers would just sit there at low CPU not scaling up. By the time anyone noticed, we were way behind.

A quick word on HPA

The Horizontal Pod Autoscaler watches a metric and adds or removes pods to keep that metric near a target. The common setup is CPU. You say “keep average CPU around 60%”, and if CPU goes above that the HPA adds pods, if it drops below it removes them.

CPU works well for request-serving services. More traffic means more CPU, more CPU means more pods, and the new pods take real load off the old ones. The signal and the work line up.

Why CPU is the wrong signal here

Queue-draining work breaks that link.

A worker pulling from Pub/Sub spends a lot of its time waiting. It waits on the network to pull a message, waits on a database call, waits on some downstream API. While it waits it uses almost no CPU. So you can have a huge backlog and workers that are busy but not CPU-busy. The HPA looks at CPU, sees it sitting at 20%, and concludes everything is fine. It does not scale up. Meanwhile the backlog keeps growing.

The reverse happens too. A burst of cheap messages can spike CPU for a moment and trigger scale-up even though there is barely any backlog, so you scale on noise.

The core issue is that CPU does not measure the thing we actually care about. We care about how far behind we are. CPU is a poor proxy for that.

Scaling on backlog instead

What we actually want is to scale on queue depth: the number of messages that have been published but not yet acknowledged. That tells us how much work is waiting. If it climbs, we are falling behind and need more workers. If it sits near zero, we have enough.

Pub/Sub exposes this. The metric is the number of undelivered (unacked) messages on a subscription. Kubernetes can scale on it as an external metric, meaning a metric that lives outside the cluster. You run an adapter that reads the value from Cloud Monitoring and feeds it to the HPA.

The HPA config then targets a per-pod backlog instead of CPU. The shape is something like this:

metrics:
  - type: External
    external:
      metric:
        name: pubsub.googleapis.com|subscription|num_undelivered_messages
        selector:
          matchLabels:
            resource.labels.subscription_id: my-subscription
      target:
        type: AverageValue
        averageValue: "100"

The way to read averageValue: 100 is “aim for about 100 undelivered messages per pod.” If there are 1000 messages waiting, the HPA wants roughly 10 pods. If there are 50, one pod is plenty. As the backlog grows, pod count grows with it, and the new pods pull messages off the same subscription so the backlog actually comes down.

You pick the target number based on how fast one pod drains messages and how much lag you can tolerate. A smaller number scales up more aggressively and keeps lag low but uses more pods. We tuned it by watching the lag during a normal busy period and adjusting.

What changed

After the switch, scale-up happened when the backlog grew, not when CPU happened to twitch. During a spike the workers fanned out, drained the queue, and scaled back down. The lag stopped being a thing we got paged about.

The lesson I took from it: autoscale on the metric that describes the work, not the metric that is easiest to grab. For a worker draining a queue, that metric is the queue depth.