Build Log #3: Auto-Discovering 41 Devices in 90 Seconds (The SNMP Journey)
Every data center is a zoo. We built a system that walks in, identifies every animal, and starts tracking them — in about the time it takes to make a cup of coffee.
The Zoo Problem
Walk into any mid-market colo that's been operating for more than five years. Tell us what you find. We already know.
You've got APC AP8841 Switched Rack PDUs from 2015 that still work great but run firmware from the Obama administration. A row of shiny new ServerTech PRO2 units that your biggest customer insisted on speccing into their buildout. A couple of Raritan PX3 units — wait, no, one's a PX3 and one's a CX2. Someone ordered the wrong model three years ago and nobody returned it because the project was already behind schedule.
Down the hall, there's a mechanical room with four Liebert DS CRACs. Two are from the original buildout. One was a warranty replacement that's a slightly different model. The fourth was bought used off a decommissioned facility in Phoenix because you needed cooling capacity faster than Vertiv's lead time.
And then there's the UPS closet. Nobody goes in there unless something beeps. There are two Eaton 9PXM units and one old Liebert NX that was supposed to be decommissioned during the last refresh but somehow still has active loads on it. The management card firmware is four major versions behind because the last person who knew the login credentials left the company.
This is every data center. Every single one. And this is the environment that your monitoring tool needs to make sense of on day one.
Why sysObjectID Is the Rosetta Stone
When you do an SNMP walk on any network-managed device, one of the first things you can query is 1.3.6.1.2.1.1.2.0 — the sysObjectID. This is the device's fingerprint. It tells you exactly what make and model is sitting at that IP address, unambiguously, every time.
Here's what some common sysObjectIDs look like:
| sysObjectID | Device |
|---|---|
1.3.6.1.4.1.318.1.3.4.5 | APC Switched Rack PDU |
1.3.6.1.4.1.318.1.3.4.6 | APC Metered Rack PDU |
1.3.6.1.4.1.318.1.3.27 | APC Smart-UPS |
1.3.6.1.4.1.476.1.42 | Liebert DS/CW CRAC |
1.3.6.1.4.1.534.1 | Eaton UPS |
1.3.6.1.4.1.13742.6 | Raritan PX3 PDU |
We built a profile library around these fingerprints. Each profile maps a sysObjectID to a device type and tells our system exactly which OIDs to query for the metrics that matter — power draw per outlet, total load, branch circuit amperage, temperature sensors, humidity sensors, inlet voltage, power factor, energy consumption.
Right now we have 12 device profiles covering approximately 90% of what you'll find in a mid-market colo. APC (switched, metered, and inline PDUs), ServerTech (PRO2 and CDU series), Raritan (PX2, PX3, CX2), Liebert (DS and CW CRACs), Eaton (9PX and 9PXM UPS), and a generic MIB-2 fallback for anything we haven't profiled yet.
The generic fallback is important. If we hit a device we don't have a specific profile for, we still grab standard MIB-2 data — sysDescr, sysName, sysUpTime, interface stats. You at least know the device exists and what it claims to be. We flag it as "unproiled" and add it to our backlog. Every new facility we scan expands the profile library.
SNMPv3 From Day One (The Security Argument We Almost Lost)
Full disclosure: we almost shipped with SNMPv2c support as the default. It's easier to implement, easier to configure, and — let's be real — it's what 70% of DC infrastructure still runs. Community strings are everywhere. "public" and "private" are basically the default passwords of the SNMP world.
Then one of us imagined the conversation with a prospect's CISO:
"So your monitoring tool sends credentials in plaintext across the management network?" Meeting over.
We built SNMPv3 with authPriv as the default. SHA authentication, AES-128 encryption. Yes, it's more configuration upfront — you need a username, auth password, and privacy password instead of just a community string. But this is your management network. These are the devices that control power to your customers' servers. If someone compromises SNMP access to your PDUs, they can turn off outlets remotely. That's not a theoretical risk. That's a "we read the CVE and lost sleep" risk.
We do support SNMPv2c for legacy devices that don't support v3 — and there are a lot of them out there. But we flag it in the UI with a yellow warning badge, and the discovery wizard defaults to v3 credentials first. We're not going to be the tool that a security audit calls out.
The Discovery Wizard: 90 Seconds to Everything
Here's the actual flow:
- Enter your SNMP credentials. v3 username + auth/priv passwords, or v2c community string for legacy gear. You can enter multiple credential sets — most facilities have at least two (one for PDUs, one for infrastructure).
- Specify your management subnet(s). Usually one or two /24s. The wizard does a quick ping sweep first to find live hosts, so we're not wasting time trying to SNMP query every IP in a /16.
- Hit scan. PowerPoll sends concurrent SNMP queries to every live host, grabs the sysObjectID, matches it against the profile library, and pulls the full metric set for every identified device.
On the test facility that inspired this build log's title — a 180-rack colo with mixed APC, ServerTech, Liebert, and Eaton equipment — the scan found 41 devices in 87 seconds. Every PDU. Every UPS. Every CRAC. All identified by make and model, firmware version noted, available metrics listed, and ready for monitoring.
We had three "unknown" devices on that scan. Two were network switches with SNMP enabled (not our problem, but good to know they're there). One was an environmental monitoring unit from a vendor we'd never heard of. We profiled it, added it to the library, and it's now supported for every future scan. Every tenant makes the product smarter.
The Efficiency Bounty: Finding What Nobody Knew Was There
Here's what surprised us about running discovery scans on real facilities: the ghost devices.
Every scan we've run has turned up at least 2-3 devices that nobody on the ops team knew were active or had forgotten about. Things like:
- Ghost PDUs: Powered on, responding to SNMP, monitoring shows zero or near-zero load. These are PDUs connected to empty cabinets or decommissioned equipment that nobody unplugged. They're still drawing standby power and, if they're on metered circuits, you're paying the utility for them.
- UPSes on bypass: A UPS that got switched to bypass during a maintenance window and never got switched back. The equipment behind it thinks it's protected. It's not. The UPS is still drawing power to keep the batteries charged, but it's not actually providing any protection. One power event and everything behind that UPS goes down hard.
- CRACs cooling nobody: A CRAC unit running at full compressor capacity, return air temp showing 62°F because there's nobody in the row it's cooling. The row was emptied six months ago when a customer decommissioned. Nobody told facilities to adjust the cooling layout. That CRAC is burning electricity to cool empty space.
- Orphaned environmental sensors: Temperature and humidity sensors deployed for a customer buildout that's long gone. Still reporting data to nobody. Not a big power draw, but clutter in your monitoring landscape.
We started calling this the "Efficiency Bounty" — the money you're wasting on infrastructure that's either doing nothing or doing something nobody asked for. At the 180-rack facility, the bounty was approximately $1,200/month in wasted utility costs. At a larger facility we scanned later, it was over $3,000/month.
That's not a bug in the discovery system. That's the whole damn point. You can't fix what you can't see, and nobody was looking at these devices because nobody knew they were there — or had forgotten they were there, which amounts to the same thing.
What's Next
The profile library is growing with every deployment. We're working on auto-profiling — when the discovery engine hits an unknown sysObjectID, it does a deep MIB walk to figure out what metrics the device supports, builds a tentative profile, and flags it for human review. The goal is to get to the point where you can scan a facility with hardware we've never seen before and still get 95%+ coverage on the first pass.
We're also building out the correlation between discovered devices and physical locations. Right now, discovery tells you what's on the network. The next step is mapping that to racks, rows, and rooms automatically based on naming conventions and network topology. Because if we can tell you "CRAC-3 is cooling Row D, which is 40% occupied, while CRAC-4 is cooling Row E, which is 95% occupied," that changes how you think about cooling allocation entirely.
Next build log: teaching an ML model what "normal" looks like in a data center — and why the math is the easy part.
Want to see what we've built?
PowerPoll is live and monitoring real facilities. Take a look.
→ powerpoll.ai/dashboard