Interactive Demo November 2025 10 min read

Why You Can't Average System Health

The $3 Billion Metric

In 2016, Wells Fargo paid $3 billion in fines for a fraud scheme that was, at its core, a failure of single-metric thinking. Leadership set one goal: increase the "cross-sell ratio" - the number of products per customer. The mantra was "eight is great."

The metric looked fantastic. It went up consistently. Executives celebrated.

Meanwhile, employees were opening millions of fraudulent accounts to hit their targets.

The single number was green. The reality underneath was catastrophically red.

This is the watermelon effect - and it's not limited to banking scandals. It happens everywhere we try to compress multi-dimensional complexity into a single number.

Key Takeaways

• Single scalar metrics inevitably hide important information when systems have multiple independent dimensions
• Industry frameworks (DORA, SPACE, Flow) already recognize this - they track vectors, not averages
• The "watermelon effect" happens when metrics look green on the outside but are red inside
• Mathematical proof shows this isn't a measurement problem - it's a fundamental impossibility
• Interactive dashboard below lets you explore these trade-offs yourself

The Seduction of Single Numbers

We love single metrics. They're clean. Simple. Easy to track on a dashboard. You can plot them on a line chart, set targets, and measure progress.

But here's the uncomfortable truth: when a system has multiple independent dimensions of performance, averaging them into a single number doesn't just lose information. It fundamentally misrepresents the system's state.

Consider a software team. You could track:

Deployment frequency: How often do we ship?
Lead time for changes: How fast do commits reach production?
Change failure rate: What percentage of deployments cause incidents?
Mean time to recovery: How quickly do we fix problems?

These are the DORA metrics, backed by years of research from Google. They found something crucial: high-performing teams excel at all four metrics. But you can't average them.

Why not? Because they represent fundamentally different trade-offs. You can deploy more frequently (metric 1 up) while accepting higher failure rates (metric 3 down). Or you can reduce failures (metric 3 up) by slowing down deployments (metric 1 down).

If you average these into a single "DevOps health score," both strategies look identical. The scalar hides the critical choice you're making.

Industry Already Knows This

The software industry has learned this lesson repeatedly.

DORA metrics (Google, 2014-present): Four separate metrics, explicitly not averaged. Research shows teams need to track all four independently to understand their actual performance.

SPACE framework (GitHub/Microsoft, 2021): Created because single metrics like "lines of code" or "velocity" were misleading. Five dimensions: Satisfaction, Performance, Activity, Communication, Efficiency. The paper explicitly states: "Developer productivity cannot be measured by a single metric."

Flow metrics (Mik Kersten, 2018): Tracks flow distribution, velocity, time, load, and efficiency across value streams. Connects to business results through four separate outcomes: value, cost, quality, happiness.

Notice the pattern? These frameworks all moved from scalars to vectors. Not because researchers were being pedantic, but because single numbers kept leading organizations astray.

The Watermelon Effect in the Wild

The term "watermelon metrics" comes from IT service management. SLA compliance looked green - 95% of tickets closed within target time! But when you talked to actual users, satisfaction was terrible. Red on the inside.

How did this happen? Simple: the metric measured time-to-close, not problem resolution. Support teams closed tickets to hit targets, whether the problem was solved or not.

This pattern repeats everywhere:

Code coverage: Teams hit 100% coverage while writing meaningless tests. Every line of code is "tested," but critical edge cases are missed. Martin Fowler warns: "I would be suspicious of anything like 100% - it would smell of someone writing tests to make the coverage numbers happy."

University rankings: Schools optimize for metrics (student-faculty ratio, admission rates) while educational quality doesn't necessarily improve. Gaming the ranking becomes more important than the underlying mission.

Performance reviews: Employees hit numerical targets while team culture deteriorates, knowledge sharing stops, or strategic initiatives get ignored.

The common thread? A scalar metric that looks good while the underlying reality is problematic. Green outside, red inside.

The Mathematical Reality

Here's where it gets interesting. This isn't just a measurement problem or a management failure. There's a mathematical proof that single scalars fundamentally cannot work.

The impossibility theorem (Sudoma, 2025) shows that no scalar function can simultaneously satisfy five reasonable requirements:

Additivity: Independent improvements should add up
Monotonicity: Improvements should never decrease the metric
Strict Monotonicity: Real improvements should visibly increase the metric
Task-Universality: Should work without constant recalibration
Relabeling Invariance: Shouldn't depend on arbitrary naming choices

Why is this impossible? Consider a system with four independent complexity dimensions:

Algorithmic: How sophisticated are the algorithms?
Information: How much do components communicate?
Dynamical: How chaotic is the temporal behavior?
Geometric: How rich is the structural topology?

The proof constructs a 5-step cycle where each step improves one dimension. If you track these as a vector, you clearly see the improvements. But any scalar average either:

Fails to register the improvements (violates monotonicity), or
Creates contradictions where the end state is both better and identical to the start (logical impossibility)

The mathematical formalization shows that when you average the vector into a scalar, you lose a quantity Σδ representing the "signal loss" - real improvements that become invisible.

This isn't a theoretical curiosity. It's why averaging your DORA metrics gives you a number that looks fine while your team is actually struggling.

The Pattern Extends Everywhere

Once you see this pattern, you see it everywhere:

Healthcare: Patient satisfaction scores can be high while clinical outcomes are poor, or vice versa. You need both metrics independently.

Climate science: Global average temperature is crucial, but hides regional extremes. A world 2°C warmer on average contains regions that are 5°C warmer - catastrophically different.

Education: Standardized test scores miss creativity, critical thinking, collaboration skills. Schools that optimize for the scalar metric produce students weak in unmeasured dimensions.

Machine learning: Model accuracy can be high while fairness is low. Averaging them creates systems that appear fine while perpetuating discrimination.

Economic growth: GDP per capita hides inequality. A nation can have high average wealth while the median citizen struggles. The scalar masks the distribution.

In each case, the single number looks informative. But it hides trade-offs that matter.

When Scalars Work (And When They Don't)

This doesn't mean all metrics are useless. Single numbers work when:

The system truly has one dominant dimension: If you're only measuring widget production rate, a single number is fine
Dimensions are perfectly correlated: If improving A always improves B, you can track just one
You're measuring within a narrow, well-defined context: Total revenue for a specific product line
Short-term tactics, not long-term strategy: Quick health checks can use scalars, but deeper analysis needs vectors

Scalars fail when:

Dimensions are independent or inversely correlated: DORA metrics, SPACE dimensions
Trade-offs exist: Speed vs. quality, cost vs. performance
Multiple stakeholders have different priorities: What's "good" depends on perspective
Gaming incentives exist: Wells Fargo's cross-selling, test coverage optimization
Long-term decisions need multiple criteria: Hiring, architecture, research direction

The rule of thumb: If experts in your field use multi-metric frameworks instead of single scores, there's probably a good mathematical reason.

The Pareto Frontier Alternative

So what do you do instead? Track the vector.

In the dashboard above, you can see multiple system states with the same scalar average but radically different characteristics. Some are fast but fragile. Others are slow but robust. Still others are balanced but mediocre.

These states lie on what mathematicians call a Pareto frontier. A state is Pareto-optimal if you can't improve one dimension without degrading another. These are the meaningful trade-offs.

When you look at the scalar average, these states appear identical. When you look at the vector, you see the actual choices:

State A: High performance, low safety (startup mode)
State B: High safety, low performance (enterprise mode)
State C: Balanced (sustainable growth)

Each is optimal for different contexts. The scalar can't tell you which one you have - or which one you need.

What This Means For Your Dashboards

Practical implications:

1. Keep your vectors visible: Don't just report the average. Show all dimensions. Use radar charts, heatmaps, or parallel coordinates.

2. Define your trade-offs explicitly: When you can't improve everything, which dimensions matter most right now? Make that decision transparent.

3. Use scalars for red flags, not optimization: A single "health score" can trigger alerts ("something is wrong"), but investigating requires looking at components.

4. Resist the urge to simplify prematurely: Yes, stakeholders want one number. But giving them a misleading number is worse than giving them four accurate ones.

5. Watch for the watermelon effect: When your scalar looks good but people are unhappy, check the components. Something is probably deteriorating.

6. Change metrics as context changes: What you optimize for in a crisis differs from what you optimize for in stability. Vectors let you see this; scalars hide it.

The Path Forward

The mathematical proof is unambiguous: single scalars cannot capture multi-dimensional complexity without losing critical information. This isn't a limitation of current metrics or measurement techniques. It's a fundamental impossibility theorem, like trying to trisect an angle with compass and straightedge.

But unlike geometric impossibilities, this one has real-world consequences. Organizations make million-dollar decisions based on metrics. People's careers hinge on performance scores. Public policy depends on economic indicators.

When we compress complexity into scalars, we don't just lose nuance. We create blind spots. We enable gaming. We hide trade-offs that matter.

The good news? We have alternatives. Vector dashboards. Multi-criteria decision frameworks. Pareto analysis. These tools let us see the actual state space of our systems.

The interactive dashboard above isn't just a demonstration. It's a microcosm of every complex system you measure. The four dimensions could be customer satisfaction, employee engagement, profitability, and sustainability. They could be speed, quality, cost, and flexibility. They could be any set of independent criteria that matter.

The mathematics is the same. The trade-offs are real. And the single number, no matter how carefully constructed, cannot tell you what you need to know.

So the next time someone asks you to "boil it down to one number," you can show them this proof - and explain why the question itself is fundamentally unanswerable.

About This Research

This blog post is based on formal mathematical research establishing the impossibility of reducing multi-dimensional complexity to scalar metrics without information loss. The work formalizes what industry practitioners have learned empirically: you need vectors, not averages, to understand complex systems.

The interactive dashboard demonstrates the core theorem with real-world scenarios from software engineering, research management, and machine learning. All mathematical claims are proven rigorously in the linked paper.

Interactive implementation: The dashboard uses Plotly.js for visualization and vanilla JavaScript for state management. Source code available on GitHub. Fully accessible (WCAG AA compliant) and mobile-responsive.

About the author: Oksana Sudoma is a researcher exploring mathematical foundations of complexity measurement, with applications to software systems, physics, and data science. This work bridges theoretical computer science, information theory, and practical engineering.

Oksana Sudoma

Independent Researcher

← All Posts

Why You Can't Average System Health

The $3 Billion Metric

Key Takeaways

Interactive Dashboard

How to Use

Pre-loaded Scenarios

Software Optimization

Research Project

Neural Network