Build Log
February 9, 2026 7 min read

Build Log #4: Teaching an ML Model What 'Normal' Looks Like in a Data Center

PowerPoll Team

Your CRAC was telling you it was failing for three weeks. Nobody was listening because the number on the dashboard was still green. We built a system that listens.

The Slow Kill

Here's a scenario that every DC ops person will recognize, because every DC ops person has lived it at least once.

Your CRAC return air temp sits at 72°F. That's normal. That's been normal for months. Your monitoring dashboard shows a nice green indicator. Everything's fine.

Week one: 72.3°F. Still green. Still "normal."

Week two: 73.1°F. Green. Nobody looks at it.

Week three: 74.2°F. Still technically below the 78°F warning threshold. Dashboard is green. The NOC tech glances at it during his shift and sees green. Moves on.

Then it's the hottest day of July. Outside air temp hits 108°F. Your rooftop condensers are working harder than they've worked all year. That CRAC — the one that's been slowly degrading for three weeks because a compressor is losing refrigerant charge — finally can't keep up. Return air temp shoots through 78°F, through 80°F, through 82°F. Your monitoring alerts fire. Your phone rings. It's three customers calling at once because their servers are thermal throttling and their applications are crawling.

The CRAC was telling you it was failing. For three weeks. Nobody heard it because the alert threshold was set at 78°F and the number was still "green." The failure wasn't the CRAC — it was the monitoring philosophy.

Why Threshold-Based Monitoring Is Necessary But Not Sufficient

Let's be clear: we're not saying threshold alerts are useless. You absolutely need "alert when temp > 80°F" or "alert when PDU load > 80% of breaker rating." Those are safety nets. They catch catastrophes. They prevent fires, both literal and figurative.

But thresholds only catch the cliff. They don't see the slope. By the time you hit a threshold, you're already in trouble. The question is: can you see the trouble coming?

Traditional monitoring gives you two states: fine and not fine. Green and red. What you actually need is a third state: "technically fine but drifting in a direction that isn't." That's the gap we built the ML pipeline to fill.

What We Built: z-Scores, IQR, and 28 Million Data Points

Here's where we might surprise you: the math isn't complicated. We're not running deep neural networks or transformer models on your CRAC data. We're using z-score anomaly detection with an IQR (interquartile range) fallback. If you took a statistics class in college, you've seen this before. It's freshman-level math.

A z-score tells you how far a data point is from the mean, measured in standard deviations. If your CRAC normally runs at 72°F with a standard deviation of 0.5°F, and today's reading is 74°F, that's a z-score of 4.0 — meaning it's four standard deviations from normal. That's statistically unusual. Flag it.

The IQR fallback handles the cases where z-scores get unreliable — when the data isn't normally distributed, which happens more often than you'd think in DC environments. Power consumption data, for example, often has bimodal distributions (daytime peaks, nighttime valleys). IQR doesn't assume normal distribution, so it catches anomalies that z-scores might miss in skewed data.

The model trains on your facility's data. Not a generic baseline. Not "industry average PUE." Your data, your normal. If your PUE consistently runs at 1.4, that's your baseline. If it drifts to 1.5, that's a 7% efficiency drop and we flag it — even though 1.5 is technically a great PUE by industry standards. We don't care about industry standards. We care about your standards.

Training requires about 14 days of data to establish a reliable baseline. We've run the model against datasets as large as 28 million data points (a 200+ device facility with 6 months of history at 5-minute polling intervals). Training on that volume takes about 40 seconds. Scoring new data against the model is effectively real-time.

The Correlation Engine: Where It Gets Interesting

Anomaly detection on individual metrics is useful but limited. A CRAC temp anomaly is one data point. Is it a failing compressor? A clogged filter? A sudden increase in IT load in that row? A hot day? You don't know from a single metric.

That's where the correlation engine comes in. We cross-correlate power, cooling, and environmental data with configurable lag windows, and this is where the system starts telling you things that no human operator could realistically track.

Some examples from real deployments:

These cross-domain correlations are something no human operator can track across 200+ metrics simultaneously. Not because they're not smart enough — because there aren't enough hours in the day to stare at 200 trend lines and notice which ones are moving together. The math does it in seconds.

The Efficiency Bounty Report

This is our favorite feature, and it came out of a simple realization: ops teams and finance teams speak different languages.

When you tell an ops team "CRAC-3 return air temp anomaly detected, z-score 3.2," they know what to do. Check the compressor. Check the filters. Check the condenser coils. They're already walking to the mechanical room.

When you tell a CFO "CRAC-3 return air temp anomaly detected, z-score 3.2," they stare at you blankly. What does that mean? Is it urgent? How much does it cost?

The Efficiency Bounty report translates anomalies into dollars. It aggregates everything the ML pipeline finds — anomalous CRAC behavior, ghost devices from discovery scans, cooling units serving empty rows, UPSes on bypass drawing unnecessary power, PDU circuits that are energized but unloaded — and converts it into a single number.

"Your facility is wasting approximately $4,800/month on ghost servers, cooling inefficiencies, and unbilled power overages. Here's the itemized list."

That number gets the CFO's attention. That number gets budget approved. "CRAC-3 anomaly" doesn't survive the first slide of the budget meeting. "$57,600 in annual waste" gets you a maintenance contract and a cooling audit before the meeting ends.

We break the report into categories:

The Hard Part (It's Not the Math)

Here's what took us the longest, and it's the thing that every "ML-powered" product conveniently glosses over in their marketing: the data pipeline is 90% of the work. The math is 10%.

z-scores and IQR are straightforward. Any data scientist can implement them in an afternoon. The hard part is getting clean, normalized, time-aligned data from a dozen different device types with different polling intervals, different units, different data formats, and different failure modes.

Some specifics:

None of this is glamorous. None of it makes for good marketing copy. But it's the difference between an ML system that works in a demo and one that works in production, month after month, on messy real-world data from equipment that was installed before the iPhone existed.

What We Learned

Three lessons from building this pipeline:

First: simple models beat complex models when your data is noisy. We tried LSTM networks early on. They worked great on clean test data and terribly on production data with gaps, counter resets, and unit inconsistencies. z-scores and IQR are robust to noise in a way that deep learning models aren't, unless you invest heavily in data preprocessing — at which point the preprocessing is doing the heavy lifting, not the model.

Second: facility-specific baselines are non-negotiable. We briefly considered building a "universal" baseline using aggregate data from multiple facilities. Terrible idea. Every facility has different equipment, different cooling architectures, different climates, different customer mixes. What's anomalous in a Phoenix facility (PUE spike in summer) is normal operation. What's normal in a Minneapolis facility (CRAC compressor off for 4 months because the economizer handles the load) would be a critical alarm in Miami. The model has to learn your facility, or it's useless.

Third: the output has to be money, not metrics. We've watched enough dashboards get ignored to know that ops teams don't need more charts. They need actionable signals with business context. "CRAC-3 anomaly" gets noted and forgotten. "$1,200/month in cooling waste, here are the three units to check" gets fixed. Every DCIM shows you pretty charts. We wanted to show you money — money you're wasting, money you're not billing, money you're leaving on the floor. That's what the ML is for.

Want to see what we've built?

PowerPoll is live and monitoring real facilities. Take a look.

→ powerpoll.ai/dashboard