Analytics & Metrics

Gaffer provides analytics to help you understand your test suite’s health and identify problems before they slow your team down. Our metrics are designed around one principle: helping developers stay productive.

The Problem with Test Suites

As test suites grow, they often become a source of friction rather than confidence. Tests that randomly pass or fail waste developer time. Slow feedback loops block PRs. And without visibility into trends, problems compound silently until the whole suite becomes a burden.

Gaffer’s analytics give you the visibility to catch these issues early and keep your test suite working for you, not against you.

Health Score

The Health Score is a single number (0-100) that summarizes your test suite’s overall condition. It’s designed to answer the question: “How much can I trust my tests right now?”

The score is calculated from three factors:

Pass Rate (60%) - Higher pass rates contribute positively
Flaky Test Percentage (30%) - Fewer flaky tests means a higher score
Trend Direction (10%) - Improving trends boost the score; declining trends reduce it

Pass rate carries the most weight because it’s the most direct signal of whether your tests are doing their job. A failing test suite needs attention regardless of other factors. Flaky tests still have significant impact at 30% because they erode trust and waste developer time even when the overall pass rate looks healthy.

Score Ranges

90-100 - Excellent. Your test suite is reliable and trustworthy.
75-89 - Healthy. Minor issues to address, but tests are useful.
50-74 - Needs Attention. Flaky tests or failures are impacting productivity.
25-49 - At Risk. Significant issues undermining test value.
0-24 - Critical. Tests may be causing more harm than good.

Flaky Test Detection

A test is marked as “flaky” when it flip-flops between passing and failing without any code changes. These tests are particularly harmful because they:

Force developers to re-run CI pipelines, wasting time and compute
Erode trust in the test suite (“it’s probably just flaky, merge anyway”)
Hide real failures in noise

How We Calculate It

We use a flip rate algorithm. For each test, we track the sequence of pass/fail results across runs. A “flip” occurs when a test changes from pass to fail (or vice versa) between consecutive runs.

The flip rate is calculated as:

flip_rate = number_of_flips / (total_runs - 1)

For example, if a test has results [pass, fail, pass, pass, fail] over 5 runs, that’s 3 flips in 4 transitions = 75% flip rate.

By default, tests with a flip rate of 10% or higher are flagged as flaky. You can adjust this threshold in Settings > Analytics based on your team’s tolerance.

Minimum Sample Size

To avoid false positives, we require at least 5 runs before flagging a test as flaky. A single failure in 2 runs might be a real bug; a pattern across 5+ runs is a signal.

A Note on “Flaky” vs “Unreliable”

There’s ongoing debate in the testing community about what “flaky” really means. Is it a race condition? A test environment issue? A genuine intermittent bug in production code?

Our perspective: it doesn’t matter for developer productivity. Whether a test fails randomly due to timing issues, external dependencies, or cosmic rays, the impact is the same - developers waste time investigating, re-running, and eventually ignoring it.

Gaffer flags these tests so you can decide what to do: fix them, quarantine them, or delete them. The goal is to keep your test suite providing reliable signal, not noise.

Pass Rate

Pass rate is the percentage of tests that passed, calculated as:

pass_rate = passed_tests / (passed_tests + failed_tests)

Skipped tests are excluded from this calculation since they don’t represent actual test execution.

We track pass rate over 30 days and show the trend direction (improving, stable, or declining) to help you spot regressions early.

Analytics Settings

You can configure analytics behavior in Settings > Analytics:

Flaky Threshold - Adjust the flip rate percentage that triggers flaky detection (default: 10%)
Manual Recompute - Trigger an immediate analytics refresh instead of waiting for the next scheduled computation

Data Freshness

Analytics are pre-computed every 4 hours to ensure fast dashboard loading. When you upload new test reports, the data will be reflected in the next computation cycle.

If you need immediate results (e.g., after uploading a batch of historical reports), use the “Compute now” button in Settings > Analytics.

Health Alerts

Gaffer can notify your team when your test suite’s health changes. Alerts are evaluated every 4 hours alongside the regular analytics computation.

Health Degradation Alerts

When your health score drops across a label boundary (e.g., from “Healthy” to “Needs Attention”), Gaffer sends a notification with the previous and current scores. This catches regressions before they compound.

New Flaky Test Alerts

When new tests are detected as flaky for the first time, Gaffer sends a notification listing the newly flaky tests. This lets your team investigate while the cause is still fresh.

Cooldown

To avoid notification fatigue, each alert type has a 24-hour cooldown per project. If your health score degrades multiple times within 24 hours, you’ll only receive the first notification.

Notification Destinations

Health alerts are sent to your configured notification destinations. Gaffer supports Slack and Webhooks for health alerts. Configure destinations in Settings > Notifications.

Next Steps

Analytics work best when you have consistent test data flowing in. If you haven’t already:

Set up the GitHub Action to automatically upload reports from CI
Integrate the Upload API into your custom pipeline