Realistic needs for scale

Jan 12, 2026

Async frameworks are very popular right now. They promise greater scale but come with some difficulties in programming, so there is a trade off. They make you work a little harder to make a function call at all, and, when your program has a mistake, you get less benefit from stack traces than you are used to.

An example of this tradeoff can be seen on the Tokio home page, where they emphasize that a Tokio-based server can support 100k requests/second:

Building on top of Rust, Tokio provides a multi-threaded, work-stealing scheduler. Applications can process hundreds of thousands of requests per second with minimal overhead.

In my experience, the real needs for an API server are astronomically lower than this. It is common to want 100 requests/second at most, and—just as important for server reliability—10-20 concurrent requests at most. At those levels, concurrency is not where your problems are going to lie, and you are free to make things easier on the developer rather than improve scale.

I know these numbers from working on a lot of servers and from helping out on incident response. There are certain orders of magnitude that you see over and over as you look at graphs of the hundreds of servers a higher-end web site will have.

I think these common rates may be due to a few factors, at least one of which will apply to the majority of servers.

Incoming requests. If you think about the incoming requests from users, then stop a moment and consider what the user is doing and how it leads to server requests. If they are using a browser-based UI and occasionally clicking on buttons that cause a call to the server, then a typical rate of requests coming from a single user is going to be maybe one server request every 10 seconds, so 0.1 requests/second. If you have a server that handles 100 requests/second, then you can support 1000 active users at the same time with each server and then use multiple servers to support more.
Outgoing requests. While some API servers are compute-bound, the more common case is that most of the latency in an API server is spent waiting on calls to databases or other external services. For databases in particular, they do not replicate as well and are frequently a failure point in an incident when some query runs slowly and other requests start piling up faster than they can be served. Databases are specialized and are built for their role, but you still don’t want more than maybe 100-1000 concurrent requests on a database at a time. If you have 100 servers funneling calls to one database, which is not uncommon at scale, then you want each server to only issue ten or fewer requests at a time to that database.
Memory. If your server isn’t blocking on database calls or service-to-service calls, then it’s either really fast (< 10ms) or is doing something computational such as evaluating rules or computing derived metrics. If it’s doing a heavier computation, it often needs significant amount sof memory. It’s not the most common thing to have an issue with, but for servers that just plain do a lot of computation, you frequently want to keep them at 10-20 concurrent requests, and certainly under 100 concurrent requests, just so that they don’t run out of memory and catastrophically fail all of their request load.

For a given server to want 1000 requests/second or more, it has to be an exception to all three of the above rules. Let’s consider what that looks like:

It’s either not driven by user activity, or you are handling a massive number of users on a single server for some reason. This can happen for Nginx or other proxy servers where you might have a smaller number of servers just because you can. It could happen for an Etcd server just because the consensus protocols work better when you have a small number of nodes (< 10).
It isn’t limited by its external requests. It doesn’t use a database, and if it’s waiting on other servers at all, there are enough of them to easily handle anything the server may throw at them. Nginx is an example again; it doesn’t directly block on a database call, and the servers it calls through often have 10x to 100x more instances or more than your number of Nginx servers.
It doesn’t use much memory per call. Nginx is somewhat of an example here, because it can limit its buffer sizes and force the server it is proxying to to simply send data more slowly if an end user cannot keep up.

All in all, when I implement an API server, I often start with a thread per request rather than an async framework, if I get to choose. Scale only matters up to a point, but development productivity almost always matters.

Lex Spoon

Discussion about this post

Ready for more?