Flaky tests are tests that produce inconsistent and non-deterministic results, sometimes passing and sometimes failing. They can undermine the reliability of testing processes and complicate software development by masking real issues and wasting time.
Flaky tests are particularly difficult to debug and fix because of their non-determinism. We cannot simply go through the usual development cycle of test → modify → re-test. This is because we cannot reliably reproduce the error on each re-test, and thus, cannot know whether any single modification has corrected it.
Rather than going all in on one tactic, for flaky tests, I prefer to have a grab-bag of techniques at my disposal. I'll pick and choose one or tactics from this grab bag, based on the situation and context.
In this article I'll share my grab-bag of techniques. These are tactics I've used myself or seen used by others with success.
First we try to reproduce and diagnose the flakiness.
Then, once we have a (hopefully firm) notion of the cause, we can apply solutions or mitigations.
The examples are in Jest and Playwright, as that is what I use in most of my work environments, but similar principles likely apply to other tools.
But before diving into tactics, let's take a brief step back and look at first reproducing the flakiness.
Fundamental to addressing any kind of software bug is reproducing it.
But how do you reproduce a flaky test? As discussed above, flaky tests are difficult to reproduce consistently because their failure is non-deterministic: sometimes they fail sometimes not.
There are a couple of options here:
We cannot reproduce the failure on a single run but we might have a chance on multiple runs.
Assuming Jest and a test script, we can use a command like this to repeatedly run a test:
for i in {1..20};
do (
npm run test -- '{test_file}' --no-watch ||
(echo "Failed after $i attempts" && break)
);
done
(Replace {test_file} with your test path and filename).
Some test frameworks provide this re-running capability out-of-the-box. Here's how to do it with the Playwright CLI:
npx playwright test {test_file} --repeat-each=20 --fail-on-flaky-tests
To increase the failure rate for reproduction purposes, we can simulate failure conditions.
For example:
These failure conditions will give us more failures, which may hopefully resolve in faster diagnosis of the cause.
Some technologies for enacting these simulations include:
Using VirtualBox or similar, we can configure limited resources. VirtualBox allows limiting CPU count and processing cap.

Using Docker or similar, we can configure limited resources. Docker allows this via CPU arguments.
For example, we could create a Dockerfile for our app:
FROM node:22.12.0-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
And use it run our tests, with constrained CPU, like this:
docker run --cpu-shares 2048 tests npm run test
In Jest, we can try one or more of:
--no-cache--runInBand--maxWorkersAlternately/additionally, we can try to gather information about the failures we have had so far.
For example, if a test is being flaky in CI, we can gather log output from the CI environment and examine it to search for clues.

Maybe we can learn something about the cause of the failures by observing phenomena related to the failures.
| 👀 Phenomena | 🚨 Implications |
|---|---|
Test fails at a certain time of day | Issue with date/time logic |
Test fails only when modified | Issue with caching |
Test fails when the CI server is being heavily utilised | Issue with test being vulnerable to resource availability |
Once we are able to reproduce flakiness, we can move to diagnosis, to uncover the root cause.
Similar to diagnosing regular bugs, we can diagnose flaky tests by making small changes and measuring the results. With flaky test rates, rather than a single pass/fail, we measure the overall pass/fail rate. A significantly lower percentage of failures can be correlated with a code change to help uncover the cause of the failure.
Some of the usual diagnostic techniques can be applied:
git bisect or similar to compare results between recent versions of the branchThis section covers possible solutions to flaky tests.
Some solutions might become evident from examining output after reproducing flakiness. In other cases, it might be worth experimenting with various solutions in a "try-and-see" approach.
Problem: Operations occur before the DOM has completed loading. For example, the test tries to click a button inside a dialog before the dialog has loaded.
Solution: Wait until elements have been rendered before performing operations that depend on them.
❌
const deleteButton = screen.getByText("Delete");
await userEvent.click(deleteButton);
const confirmButton = screen.getByRole("button", { name: "Confirm" });
await userEvent.click(confirmButton);
✅
const deleteButton = screen.getByText("Delete");
await userEvent.click(deleteButton);
await screen.findByRole("dialog");
const confirmButton = await screen.findByRole("button", { name: "Confirm" })
await userEvent.click(confirmButton);
Problem: Test operations are being done before Promises on which they rely have been completed. For example, an async API call is made, but the test runs an operation that depends on the API result before the promise has completed.
Solution: Wait for calls to have been made, using, say, a "completed" flag.
❌
jest.spyOn(accountsApi, "getAccounts").mockResolvedValue([{
accountId: 1234,
isPrimary: true,
}]);
jest.spyOn(accountDetailsApi, "getAccountDetails").mockResolvedValue({
accountId: 1234,
balance: 10,
accountName: "Jack's Primary Account",
});
render(<PrimaryAccountDetails />);
const heading = screen.getByRole("heading", { level: 2 });
expect(heading).toHaveText("Jack's Primary Account");
✅
jest.spyOn(accountsApi, "getAccounts").mockResolvedValue([{
accountId: 1234,
isPrimary: true,
}]);
jest.spyOn(accountDetailsApi, "getAccountDetails").mockResolvedValue({
accountId: 1234,
balance: 10,
accountName: "Jack's Primary Account",
});
render(<PrimaryAccountDetails />);
await waitFor(() => {
expect(accountsApi.getAccounts).toHaveBeenCalled();
});
await waitFor(() => {
expect(accountsApi.getAccountDetails).toHaveBeenCalled();
});
const heading = screen.getByRole("heading", { level: 2 });
expect(heading).toHaveText("Jack's Primary Account");
Problem: A test is very long and, thus, times out before being completed.
Solution: Reduce test size.
Even if we prefer longer integration-style tests, as recommended by Kent C. Dodds in [Write fewer, longer tests], there can still be ways to reduce the size of our tests while preserving their scope.
For example:
Problem: Test files are large. Processing each large file ties up system resources (especially processor usage), causing other tests to time out.
Solution: Reduce test file size.
One way is to split up test files by function or component.
- orders-pending.test.tsx
- orders-delivered.test.tsx
- orders-cancelled.test.tsx
- orders-previous.test.tsx
Or we could use numbering or lettering system.
- small-test-01.test.tsx
- small-test-02.test.tsx
- small-test-03.test.tsx
- small-test-a.test.tsx
- small-test-b.test.tsx
- small-test-c.test.tsx
Problem: Test runners such as Jest may be greedy and fail to balance resource usage between tests when running many tests in parallel on resource-constrained environments such as CI
Solution: Check the CPU configuration of the CI if possible. Try to reduce the number of simultaneously running tests by configuring the test runner.
Contrary to our human intuition ("more is better") it may be better to reduce the number of simultaneously running tests. This is because test runners can be greedy and consume as many resources as possible at any given time (CPU, memory, etc). Running multiple tests at once can cause resource usage to become imbalanced, as tests compete with eachother for resources.
The solution may be to reduce the maximum number of workers. Jest allows this to be configured via the maxWorkers setting. However, if this exceeds the number of CPU cores on the machine, the processor may be forced to split execution time between multiple threads. This may cause tests to take longer than expected to execute, resulting in timeouts. This problem may only occur on CI environments, where CPU resources may be more limited, making it tricky to identify. Try reducing the maxWorkers setting or even eliminating it. (Jest defaults to the number of cores available on the machine, which is usually the safest bet.)
Problem: Tests leave behind "uncollected garbage", such as memory usage, threads, promises, etc. This slows down the test suite as a whole, making some tests flaky.
Solution: Cleaning up after each test reduces resource demand on the test runner, which reduces the occurrence of flakiness.
Note that the flaky test itself might not be the culprit here, but rather, some or all other test(s) as a whole generating garbage. This uncollected garbage might only be noticeable when all tests are run together in CI, not when running an individual test on its own. This can make the "uncollected garbage" issue tricky to detect. It might only be detectable by trial-and-error – say, observing resource usage on the whole test suite over repeated runs.
Most test frameworks provide allow timeouts to be configured.
Increasing the timeout allows tests to run longer without failing, which may solve flakiness.
Downside: Increasing timeouts too much or globally might allow performance issues to creep into the application. Timeouts should be increased only on flaky tests if possible, and we should find ways to enure those features continue perform adequately for end-users.
Problem: The application we're testing is itself buggy or just slow. If the application itself is slow, then the automated tests that exercise it will probably also be slow, leading to flakiness.
Solution: Find slow points in the application under test and optimise their performance.
To find slow points, we can add timer statements to different parts of the test or application.
console.time("Fetch user details");
const userDetails = await fetchUserDetails();
console.timeLog("Fetch user details");
We can also try rigorous manual testing, combined with performance tooling, such as Chrome Devtools Performance tab.
Techniques to improve performance can then be applied – see: improving performance in React apps.
So maybe we've tried all the above and nothing has worked. In that case, we can consider mitigation – approaches that reduce the impact of the problem without solving it entirely.
These might be used temporarily as an emergency resort or permanently if considered a reasonable compromise.
Problem: Some features of our application may be resource-intensive, causing flakiness, while not offering much value in an automated testing context.
Mitigation: Disable resource-intensive features for test environment only.
Certain application features may be inherently resource intensive and not needed to verify correctness for a given automated test.
Common examples:
These features can be disabled only for test execution, via, say, feature flags.
Problem: Flakiness produced by resource-constrained environments is not worth the cost savings of the resource constraints.
Mitigation: Increase resources to get better value for investment, such as higher developer productivity during a critical period.
Depending on the cause, test flakiness might be drastically reduced in the short-term by simply beefing up resources on the test runners. Depending on the organisation, business context, timeframe, etc. this might be an optimal approach.
For example, suppose a legacy system is scheduled to be decommissioned in a few weeks, with a newer, totally re-written version already performing well in canary testing and ready to be rolled out next week. If the legacy system has a lot of flaky tests, blocking developers from deploying changes during that short space of a few weeks, it might make sense to increase resources just to unblock developers. Engineer time is more valuable and costly than brute resource usage.
Or suppose the business context is seasonally sensitive, such as an online retailer experiencing very high demand during holiday periods. During this period there is a high velocity of new feature releases, requiring a large number of automated tests of varying quality to run smoothly. Here, trading off resource cost for feature velocity might be worthwhile, at least during the peak period.
Problem: Flaky tests block the whole pipeline, interfering with delivery velocity.
Mitigation: Separate flaky tests from non-flaky tests, to ensure that they run correctly or at least do not disrupt other tests.
Many test frameworks allow tags to be applied to tests, allowing those tests to be grouped and treated as a unit, for separate execution, separate configuration, or some other kind of separation.
Flaky tests, once identified, can be grouped in this way for special treatment.
In Playwright, test tags can be included in the test name:
test('test full report @flaky', async ({ page }) => {
// ...
});
In Jest, a similar effect can be achieved by passing a carefully written regex to the testNamePattern config setting:
package.json:
"scripts": {
...,
"test": "jest --testNamePattern='^(?!.*\@flaky).*$'",
"test-flaky": "jest --testNamePattern='^(.*\@flaky).*$'",
...,
},
Once separated, flaky tests might be treated in various ways:
Full end-to-end browser tests are known to be more flaky than traditional unit or unit-style integration tests. This is due to the performance overhead of loading a whole browser, loading the whole application at once, triggering interactions with whole DOM elements and waiting for feedback.
We could instead shift some of these tests to integration-style unit tests. Described in Kent C. Dodd's famous article Static vs Unit vs Integration vs E2E Testing for Frontend Apps, these tests can cover entire user flows (such as logging in) while mocking the calls that could otherwise call flakiness, such as server-side API calls.
Another option, for tests target intermittent but approximately deterministic behaviour, is to use fuzzy logic to verify that behaviour. For example, suppose we need to exercise some behaviour that operates on the current date and time, but for some reason cannot control the current date and time by mocking. If the test assertion does not need to have millisecond-level precision, perhaps we could instead assert against a range considered correct.
❌
expect(timeAfterClickPause.getTime())
.toEqual(new Date(2026, 3, 1, 13, 1, 1).getTime());
✅
expect(timeAfterClickPause.getYear()).toEqual(2026);
expect(timeAfterClickPause.getMonth()).toEqual(3);
expect(timeAfterClickPause.getDate()).toEqual(1);
expect(timeAfterClickPause.getHours()).toEqual(13);
expect(timeAfterClickPause.getMinutes()).toEqual(1);
expect(timeAfterClickPause.getSeconds()).toEqual(1);
If our automated test is trying to exercise something that is inherently prone to intermittent failure, within no acceptable margin of error, perhaps we need a different method of verification altogether.
For example, there is probably no good way to write an automated test for generating the next Bitcoin hash on the official fork. (Until/unless we get quantum computing in the cloud, in which case, any crypto-based business model might be in jeopardy!) For this case, we would probably need to wait until we have a large and engaged enough user base and then apply observability.
Various methods of verification that might fit the scenario:
git bisect or similar).Flaky tests undermine testing processes, developer morale and ultimately product reliability. So it's important to address them. Unfortunately fixing flaky tests can be more difficult than consistently failing tests, due to their non-determinism.
Difficulties reproducing flaky tests can be addressed by:
Flaky tests can be dealt with by: