Build Log #4: Teaching an ML Model What 'Normal' Looks Like in a Data Center
Your CRAC was telling you it was failing for three weeks. Nobody was listening because the number on the dashboard was still green. We built a system that listens.
The Slow Kill
Here's a scenario that every DC ops person will recognize, because every DC ops person has lived it at least once.
Your CRAC return air temp sits at 72°F. That's normal. That's been normal for months. Your monitoring dashboard shows a nice green indicator. Everything's fine.
Week one: 72.3°F. Still green. Still "normal."
Week two: 73.1°F. Green. Nobody looks at it.
Week three: 74.2°F. Still technically below the 78°F warning threshold. Dashboard is green. The NOC tech glances at it during his shift and sees green. Moves on.
Then it's the hottest day of July. Outside air temp hits 108°F. Your rooftop condensers are working harder than they've worked all year. That CRAC — the one that's been slowly degrading for three weeks because a compressor is losing refrigerant charge — finally can't keep up. Return air temp shoots through 78°F, through 80°F, through 82°F. Your monitoring alerts fire. Your phone rings. It's three customers calling at once because their servers are thermal throttling and their applications are crawling.
The CRAC was telling you it was failing. For three weeks. Nobody heard it because the alert threshold was set at 78°F and the number was still "green." The failure wasn't the CRAC — it was the monitoring philosophy.
Why Threshold-Based Monitoring Is Necessary But Not Sufficient
Let's be clear: we're not saying threshold alerts are useless. You absolutely need "alert when temp > 80°F" or "alert when PDU load > 80% of breaker rating." Those are safety nets. They catch catastrophes. They prevent fires, both literal and figurative.
But thresholds only catch the cliff. They don't see the slope. By the time you hit a threshold, you're already in trouble. The question is: can you see the trouble coming?
Traditional monitoring gives you two states: fine and not fine. Green and red. What you actually need is a third state: "technically fine but drifting in a direction that isn't." That's the gap we built the ML pipeline to fill.
What We Built: z-Scores, IQR, and 28 Million Data Points
Here's where we might surprise you: the math isn't complicated. We're not running deep neural networks or transformer models on your CRAC data. We're using z-score anomaly detection with an IQR (interquartile range) fallback. If you took a statistics class in college, you've seen this before. It's freshman-level math.
A z-score tells you how far a data point is from the mean, measured in standard deviations. If your CRAC normally runs at 72°F with a standard deviation of 0.5°F, and today's reading is 74°F, that's a z-score of 4.0 — meaning it's four standard deviations from normal. That's statistically unusual. Flag it.
The IQR fallback handles the cases where z-scores get unreliable — when the data isn't normally distributed, which happens more often than you'd think in DC environments. Power consumption data, for example, often has bimodal distributions (daytime peaks, nighttime valleys). IQR doesn't assume normal distribution, so it catches anomalies that z-scores might miss in skewed data.
The model trains on your facility's data. Not a generic baseline. Not "industry average PUE." Your data, your normal. If your PUE consistently runs at 1.4, that's your baseline. If it drifts to 1.5, that's a 7% efficiency drop and we flag it — even though 1.5 is technically a great PUE by industry standards. We don't care about industry standards. We care about your standards.
Training requires about 14 days of data to establish a reliable baseline. We've run the model against datasets as large as 28 million data points (a 200+ device facility with 6 months of history at 5-minute polling intervals). Training on that volume takes about 40 seconds. Scoring new data against the model is effectively real-time.
The Correlation Engine: Where It Gets Interesting
Anomaly detection on individual metrics is useful but limited. A CRAC temp anomaly is one data point. Is it a failing compressor? A clogged filter? A sudden increase in IT load in that row? A hot day? You don't know from a single metric.
That's where the correlation engine comes in. We cross-correlate power, cooling, and environmental data with configurable lag windows, and this is where the system starts telling you things that no human operator could realistically track.
Some examples from real deployments:
- CRAC compressor cycling anomaly + PDU load increase on the same row, 15 minutes earlier. Translation: a customer deployed new equipment, the thermal load increased, and the CRAC is working harder. Not a failure — but if the CRAC was already near capacity, this might push it over on a hot day. Actionable intelligence: check if the new deployment needs a cooling adjustment.
- Utility power fluctuation + UPS battery discharge, 30 seconds later. Translation: your utility feed sagged and the UPS covered it. If this is happening multiple times per week, your utility feed is unstable and you need to talk to the power company before a full outage. Most operators don't notice these micro-events because the UPS handles them silently.
- Gradual PUE increase + no change in IT load. Translation: your cooling efficiency is degrading. Could be dirty coils on a condenser, a CRAC that's lost refrigerant, or a failed economizer damper. The IT load didn't change, so it's not a demand problem — it's an infrastructure problem. Catch it now, fix it cheap. Catch it in three months, fix it expensive.
- PDU branch circuit imbalance drifting over time. Translation: as customers add and remove equipment, the load across your A and B feeds is becoming unbalanced. You're getting closer to tripping a breaker on the heavy side while the light side has headroom. Rebalance before something trips at 2 AM.
These cross-domain correlations are something no human operator can track across 200+ metrics simultaneously. Not because they're not smart enough — because there aren't enough hours in the day to stare at 200 trend lines and notice which ones are moving together. The math does it in seconds.
The Efficiency Bounty Report
This is our favorite feature, and it came out of a simple realization: ops teams and finance teams speak different languages.
When you tell an ops team "CRAC-3 return air temp anomaly detected, z-score 3.2," they know what to do. Check the compressor. Check the filters. Check the condenser coils. They're already walking to the mechanical room.
When you tell a CFO "CRAC-3 return air temp anomaly detected, z-score 3.2," they stare at you blankly. What does that mean? Is it urgent? How much does it cost?
The Efficiency Bounty report translates anomalies into dollars. It aggregates everything the ML pipeline finds — anomalous CRAC behavior, ghost devices from discovery scans, cooling units serving empty rows, UPSes on bypass drawing unnecessary power, PDU circuits that are energized but unloaded — and converts it into a single number.
"Your facility is wasting approximately $4,800/month on ghost servers, cooling inefficiencies, and unbilled power overages. Here's the itemized list."
That number gets the CFO's attention. That number gets budget approved. "CRAC-3 anomaly" doesn't survive the first slide of the budget meeting. "$57,600 in annual waste" gets you a maintenance contract and a cooling audit before the meeting ends.
We break the report into categories:
- Ghost devices: Equipment drawing power but serving no useful purpose. Dollar value based on metered utility cost.
- Cooling waste: CRACs overcooling empty or underutilized rows. Dollar value based on estimated excess compressor runtime at local kWh rates.
- Unbilled overages: Customers exceeding committed power that isn't being captured in billing. Dollar value based on contract overage rates.
- Predictive maintenance: Equipment showing degradation patterns that will likely require repair. Dollar value based on estimated emergency repair cost vs. scheduled maintenance cost. Fixing a compressor on a planned maintenance window: $2,000. Emergency CRAC replacement on a Saturday night: $15,000 plus the customer credits.
The Hard Part (It's Not the Math)
Here's what took us the longest, and it's the thing that every "ML-powered" product conveniently glosses over in their marketing: the data pipeline is 90% of the work. The math is 10%.
z-scores and IQR are straightforward. Any data scientist can implement them in an afternoon. The hard part is getting clean, normalized, time-aligned data from a dozen different device types with different polling intervals, different units, different data formats, and different failure modes.
Some specifics:
- Polling intervals: APC PDUs report every 5 minutes. Your Liebert CRAC might report every 60 seconds. The environmental sensors update every 30 seconds. Before you can correlate these, you need to normalize them to a common time base. We resample everything to 5-minute windows with configurable aggregation (mean, max, min, or last-value depending on the metric type).
- Units: Some PDUs report power in watts. Some in kilowatts. Some in amps (which you need to multiply by voltage to get watts, and the voltage reading might be on a different OID). Some CRACs report temperature in Fahrenheit. Some in Celsius. Some in tenths of a degree (so "722" means 72.2°F). We built a unit normalization layer that handles all of this, and we still find edge cases.
- Counter resets: Energy meters (kWh counters) occasionally reset to zero — when a PDU reboots, when firmware updates, when the counter rolls over at 32-bit max. A naive pipeline sees a drop from 45,000 kWh to 0 kWh and interprets it as negative consumption, which cascades into anomaly alerts. We detect counter resets and handle them gracefully.
- Missing data: SNMP polls time out. Devices go offline for maintenance. Network blips cause gaps. If you don't handle missing data explicitly, your z-score baseline gets corrupted by nulls and your anomaly detection becomes unreliable. We use forward-fill for short gaps (< 15 minutes) and exclude longer gaps from baseline calculation.
None of this is glamorous. None of it makes for good marketing copy. But it's the difference between an ML system that works in a demo and one that works in production, month after month, on messy real-world data from equipment that was installed before the iPhone existed.
What We Learned
Three lessons from building this pipeline:
First: simple models beat complex models when your data is noisy. We tried LSTM networks early on. They worked great on clean test data and terribly on production data with gaps, counter resets, and unit inconsistencies. z-scores and IQR are robust to noise in a way that deep learning models aren't, unless you invest heavily in data preprocessing — at which point the preprocessing is doing the heavy lifting, not the model.
Second: facility-specific baselines are non-negotiable. We briefly considered building a "universal" baseline using aggregate data from multiple facilities. Terrible idea. Every facility has different equipment, different cooling architectures, different climates, different customer mixes. What's anomalous in a Phoenix facility (PUE spike in summer) is normal operation. What's normal in a Minneapolis facility (CRAC compressor off for 4 months because the economizer handles the load) would be a critical alarm in Miami. The model has to learn your facility, or it's useless.
Third: the output has to be money, not metrics. We've watched enough dashboards get ignored to know that ops teams don't need more charts. They need actionable signals with business context. "CRAC-3 anomaly" gets noted and forgotten. "$1,200/month in cooling waste, here are the three units to check" gets fixed. Every DCIM shows you pretty charts. We wanted to show you money — money you're wasting, money you're not billing, money you're leaving on the floor. That's what the ML is for.
Want to see what we've built?
PowerPoll is live and monitoring real facilities. Take a look.
→ powerpoll.ai/dashboard