People tend to worry about performance way too early in an application's lifecycle. Without being able to run the proper measurements, all we can do is speculate and theorize; that being said, certain performance optimizations are really hard to perform late in the development stage, so we can't be completely careless either. Performance begins with architectural common sense, and it develops into measurements, intricate optimizations, and carefully crafted test cases.

A keyword that commonly pops up in discussions about performance is scalability. It's so common in fact, that it almost got synonymous with performance. We like scale, we plan for scale, we design for scale, we want our apps to be able to support the hundreds of millions of users we're going to get. Scalability however, while important, it's only one small piece of an application's performance.


Scalability doesn't mean how many users your application can scale up to. It is simply a measure of how adding hardware resources affects your application's performance.

Vertical scalability means increasing the resources of your server (e.g. more CPUs, more memory). This should be easy to wrap your head around, basically a more powerful machine.

Horizontal scalability means adding more servers. This can be a bit trickier to understand like vertical scalability. How can an application run on multiple servers at the same time? — you might ask. In most cases multiple servers are running copies of the same application, and the incoming request load  is evenly distributed between these servers. The thing that does the distributing part is called a load-balancer.

Let's discuss this deeper in an upcoming article.

Response Time

This is the time it takes for the server to process a request. The request could be initiated by a client (e.g. browser, desktop or any other application with a user interface), or it could come from another server.


Load is the measure of stress on a system, which is usually measured by how many users are connected to it. Load is usually a context for another measurement we perform, like Response Time. For example we could measure the Response Time of an API endpoint with 50 users and 5000 users. Generally load will influence these measurements greatly, and that measure is called load sensitivity.


Responsiveness indicates how quickly the request is acknowledged by the system. If responsiveness is poor, users will get frustrated, even with great response times. I am sure you experienced a case when you clicked a button that should do some complex action, and wasn't sure if anything happened or not — just to find out your request is successfully processed. That feeling was the result of poor responsiveness.


Latency is usually a problem with remote systems. It measures the minimum time required to get a server response, even if no processing was required and no data is being transferred. Latency generally increases with distance, and there's not much you can do to improve it from an application design perspective, other than minimise the server requests.


Throughput is usually measured in TPS (transactions per second), and it basically means how much stuff you can do in a certain timeframe.


The efficiency of a system is simply a measurement of performance, divided by the system's resources. Getting 50TPS on one CPU is more efficient than getting 100TPS on four.


The capacity is the maximum throughput or load that a system can handle before breaking or significantly dipping in performance.

Some are easier to measure and monitor than others, but a little bit of common sense coupled with an understanding of these terms goes a long way.

For those specific cases where performance should be meticulously planned and monitored, let's discuss the topic of Performance Driven Development in an upcoming article.

Check out my wife's FREE UI designs at