What are you testing when you test performance?
I was working on a project recently that had a set of tests titled "Benchmark Tests." "Oh, nice," I thought, "someone was thinking about performance when they wrote this."
Then I ran the tests and realized that they were testing that the web server's average response time was 5 seconds. An eon in web server time!
The person who wrote this test was thinking of performance when they wrote it, but they weren't thinking about what they were testing. Is 5 seconds really an acceptable average response time?
The intent behind these tests were good, but the benchmark test didn't do much other than test that some code executed.
With that in mind, here are a few principles I use when writing tests end-to-end and performance tests.
Test What Your Code Does
Tests age out of use. Code changes and tests may start returning false positives. Take a moment to read the test and ask yourself if they actually test the code you are looking at. What does this code do? What are its inputs and outputs? Is the test using the correct inputs and outputs? Is the test testing the code in question or some other behaviour?
In the benchmark tests I mentioned above, the tests were relying on a single-threaded, single, worker Puma process to handle hundreds of thousands of web requests in order to test a Ruby Gem that executed during a web request. The benchmark against the code without the gem was inaccurate because it was testing overall web performance, which includes the server running the tests, puma itself, and the test runner's ability to send multiple requests.
A better test here is to test that the output of the library occurs in a timely manner and does not block HTTP responses. I updated this test to run multiple puma workers with two threads each. Then I sent fewer concurrent requests and compared set an expectation that the gem executes a correct number of times.
In the end I changed this test from a performance test to an end-to-end integration test that ensures the data sent to the server matches the output from the gem.
Don't Test the Computer Your Code Runs On
Performance testing can be tricky. There are so many variables outside of the code you write that can impact performance. In the Ruby Gem I was working on, the tests passed with a 95th percentile response time of under 15 milliseconds on my laptop. But on a GitHub Actions runner, the P95 went up to 25 milliseconds. You can't expect the exact same conditions each time.
Similarly, I found that the test with the Ruby gem frequently ran faster than the test without. Which is almost pure chance.
A more thorough benchmark test would execute the test a the entrypoint of the Ruby Gem rather than laying a Puma server in front of it. It would then run many many iterations on a near-identical environment and compare the results against a previous, known result that sets the bar for a minimum response time.
This way the test would execute against concrete results and reduce the chance that the machine's runtime might affect performance.