Key Performance Indicators are essentially the Gold mines of information for understanding where bottlenecks are occurring within your application’s deployment.
KPI’s illustrate a performance story by filling in resource usage details. You will want to have measurable KPI’s embedded throughout the entire deployment, end to end.
KPI’s are measurable counters and these KPI’s will trend either directly or inversely with the workload. For example, a hardware KPI is the CPU usage of a web server. The CPU’s usage metered value will increase as more users are logging in and actively using the application.
There are certain KPI’s which give you the front end performance of an application. These are valuable in determining overall scalability.
Embedded or monitored KPI’s will give you the details around determining bottlenecks and resources which are limiting scalability.
No two KPI’s are alike. No two apps are built on the same code with the same configurations and the same resources, therefore the Hunt for KPI’s commences with every new application.
Front End KPI’s
All of the basic load testing solutions out there have the typical front end KPI’s: TPS, Response Time and User load.
What do Front End KPI’s reveal?
Transactions within a load script are isolated with a tag – meaning that the transaction is surrounded by a start and stop and named appropriately (Ex. Login). TPS reveals how many times these transactions were executed during the time interval of 1 second. Each business transaction can consist of multiple transactions so 1 transaction can result in several (or a fraction of) Hits per Second. This means that TPS and Hits per seconds are interchangeable but hits per second is a more granular KPI.
The response time of transactions is determined by the time span from the first byte of the request sent to the last byte of the response received.
In a perfectly scalable application, as the workload increases (in other words, more users), the TPS increases and the response time of all scripted transactions remains somewhat linear.
But no application has resources which are infinite.
So, After a bottleneck has been encountered (key word is After), the following behavior ensues:
The TPS decreases and the response time increases.
With these three basic Front End KPI’s, TPS, response time, and User Load, you can deliver results as to how much workload the application can handle. You can even report at what specific workload the application begins to experience degraded response times and at what workload the application become completely unresponsive.
Now, let’s start with adding more KPI’s to your test harness which will tell the entire story about the application’s scalability limitations.
KPI Wealth: Gold Mines…
I call the next set of KPI’s gold mines because they are worth the price of gold AND their weight in gold… These are the KPI’s which will expose the root cause of scalability limitations.
An infrastructure architectural diagram is a requirement and if it doesn’t already exist, please create it. You will need an architectural diagram of the entire infrastructure: Bare metal machines, VM’s, web servers, application servers, messaging systems, databases, cluster layouts, load balancers, etc. Without this arch diagram, you are blind to the deployment. So Every server container, hardware or software, from the front end to the backend needs to be on this diagram. The visual diagram will serve as the drawing board for understanding business transaction flows, determining the monitoring requirements and visualizing possible scalability limitations.
Be the Ball!
Next, align business transactions with this architectural diagram. In other words, understand how each business transaction executes in terms of servers in the deployment. To do this, I like to imagine a ball, bouncing down and back up the deployment stack. A request comes into the load balancer, bounces to a web server, bounces to an app server, and so on..
KPI Wealth: Gold Mines…Infrastructure Monitoring
Infrastructure Monitoring. Every server that the data will travel through during it’s transaction flow requires a monitor. A server can be a either hardware server or a software server. A software server for example is a java application server.
There are two very illuminating KPI’s to every “server”: Hit Rates and free resources. Let’s first discuss hit rate. The hit rate will trend with the workload. As the workload increases, so does the hit rate.
Here are examples of hit rate KPIs:
For every OS (all 3 machines): TCP connection rate
Webserver: Requests per Second
MESSAGING: Enqueue/Dequeue Count
DB: Queries per Second
Remember every deployment is unique so you will need to decide what qualifies as a good hit rate per server and hook up the required monitoring.
The next set of KPI’s are free resources. I tend use Free resources instead of used resources because graphing free resource metrics will trend inversely with the workload making the lines on a graph visually easier to identify bottlenecks. However, sometimes a free counter is not available for a resource, that’s ok, so use the used metric instead.
Also, if a target resource has queueing strategies, be sure to add these queuing metrics because they will indicate an exhaustion of a free resource.
Using a typical web deployment again:
OS: CPU average IDLE
Webserver: Waiting requests
APP server: free worker threads
MESSAGING: Enqueue/Dequeue Wait time
DB: Free connections in thread pool
Environments with clusters of the same server, add these monitors to every node of the cluster.
KPI Wealth: Gold Mines…Engineering Transactions
Let’s return to those automated user load scripts and Let’s build in some engineering transactions for analyzing.
Most of your transactions up until now are probably user workflows which represent a real production workload. So now include scripts that have transactions which target the tiers of your deployment. Every deployment is unique. Here again are some examples:
Web tier: A transaction which GETs a static non cached file
App tier: A transaction which executes a method and creates objects but does not go to the DB tier
DB tier: A transaction which requires a query from the DB
Since these are all front end scripted transactions, Some transactions will hit multiple tiers but knowing which transactions have certain tiers excluded will allow for easier analysis. Really take your time and tease out tier based transactions as these will save you a ton of time in investigations later on. If you are unsure which transactions hit which tiers, ask the development or supporting infrastructure team.
Important, make each of these engineering transactions it’s own script, therefore you can graph out its own TPS and response time values independently of all the other business transactions.
Remember to add pause times into these engineering scripts to space out the intervals of execution. For example, a think time of 5 seconds placed before the engineering transaction (only 1 transaction in the script) will cause the transaction to execute every 5 seconds. This creates the sampling rate for gathering these metrics.
KPI Wealth: Gold Mines…Get Creative!
Here’s an area to show your creativity and it will pay off in understanding the scalability of your application. I hope these ideas inspire you to use a creative approach in your hunt for KPI’s.
I’ve come across applications which require instrumented KPI’s. For example, a user initiated a transaction but the browser is busy (typically polling or using ajax) to communicate with the back-end application. The “start” was detected and recorded but the “Stop” or even the “Progress” isn’t so obvious or detectable by the tool. So, An instrumented code level KPI will throw a unique string into the server responses indicating the progress or the completion of a step.
Missing KPI pieces: Perhaps not all of the moving parts are caught during the review of the architectural diagram. Spin up a fast ramping load test (we don’t care about the results) and see what processes and OS activities spin up. If you notice an external process and have no idea what it is doing… Ask! Could be an illuminating KPI candidate.
Often, the most custom and relevant KPI’s come from the in-house owners of the application. The IT team supporting the environment. Individually engage with each member of the group. Open up the lines of communication and take a bedside approach here. Pose this same question to each group member: “If I needed to monitor something in your tier of the environment which gives a clear representation of how ‘busy’ the system is and another KPI which indicates a resource ‘depletion’ … what would they be?” Their answers are priceless. Work smarter, not Harder. Add these KPI’s to your test harness. Sometimes a unique KPI can’t be polled or graphed, maybe the app throws an exception under a certain condition. Kindly ask development to instrument the code to direct these types of errors along with a timestamp to a designated file (instead of something like standard out). This will make your job easier when correlating performance issues and the “hit rate” of these errors can be graphed out for correlations.
During one of my performance projects, I had virtually no KPI’s. The requests were sent to a “sink hole” and I had no idea how to quantify the throughput or performance. I had to get creative. The company had some OS monitoring hooked up and some queuing monitors in place. So, I studied all the counters being monitored and spotted some trends which related to the workload. I then integrated these into the performance test harness. So when you hunting for KPI’s, take a look at the production monitored metrics if the app is already deployed.
Next, Prove your KPI’s Worth!