Home / Writing / Twelve substations, one monitoring stack

Twelve substations, one monitoring stack.

An industrial-automation case study from SSL Elektrik-Elektronik — what was instrumented, what was learned, and what was different from the aerospace work that followed.

Before the doctorate there were five years of building monitoring systems for oil-and-gas substations across south-east Türkiye. The lesson that survived the move into aerospace research is the one I want to write about: an instrumentation stack designed for "every fault you can think of" usually ends up missing the ones that actually happen.

01 The problem

SSL Elektrik-Elektronik served a single industrial customer base for most of its run: state and private operators in the upstream and midstream oil-and-gas sector. The job was condition monitoring on the high-voltage substations that feed everything else — the pumps, the compressors, the heat tracing on the gathering lines. A typical site held one outdoor switchyard, one indoor switchgear room, and somewhere between four and twelve transformers depending on the field's size and history.

The stated requirement, when we won the work, was straightforward: instrument the network. The unstated requirement, which is the one that matters, was: stop the surprises. The operators had not been seeing the kinds of failures their existing SCADA was designed to catch. They were seeing slow degradations — bushing leakage currents drifting up over months, partial discharge in cable terminations, transformer winding hot-spots an order of magnitude smaller than what the protection relays would trip on. The kinds of faults that, if you only have a "yes/no" trip threshold, look like a healthy network until they don't.

The temptation was to bolt on more sensors. The first vendor we replaced had attempted that — twenty-something channels per substation, an alarm wall that no one read, and a maintenance backlog that had grown rather than shrunk over three years. The instinct that produced the alarm wall is the same instinct that produces the over-instrumented aerospace stacks I would later study academically. The dimension that needs maximising is diagnostic value per channel, not channel count.

02 The method

What we built instead was a smaller, more carefully chosen network. The technical core was modest — we did not invent the channels we used, and we were not running multi-objective genetic algorithms; that would come a decade later. What we did do was three things that I would now describe in NDCI-shaped language, but at the time we just called common sense.

First, every channel had to earn its place against a named failure mode. We started from a list of seven degradation modes the operators agreed they cared about, scored across four severity tiers each. For each candidate sensor, we asked: which of those modes does this sensor distinguish? If the answer was "none uniquely" or "only ones already covered," the sensor did not go in. The reference suite the previous vendor had bid was about thirty channels per substation. The suite we shipped was eleven, and it covered six of the seven modes at the resolution the operators wanted. The seventh — early-stage core lamination shorts — we did not catch with vibration or temperature alone; that one went onto the next-revision wishlist.

Second, the alarm logic was not "any single channel breaches a threshold." It was a small set of signatures across pairs and triplets of channels that corresponded to a named mode. Bushing leakage current rising while ambient temperature was steady — one signature. Vibration spectrum shifting in the 100–300 Hz band on a transformer whose tap changer had not moved in 24 hours — another signature. Partial-discharge pulse rate above baseline, over a window long enough to ignore corona during rain — a third. The operator's view of the system was the list of named signatures, not the list of channels.1In the academic literature this is sometimes called feature-level fusion as opposed to data-level fusion. The practical justification is that operators read signatures and act on them; nobody acts on a raw temperature trace.

Third, every signature had a deliberate dead-band that suppressed it during known confounders — transformer no-load tests, scheduled tap-changer maintenance, any of the half-dozen normal operations that look like a fault to a naive threshold. The dead-band rules were small and explicit, and the operators could read and amend them. The previous vendor's system had had no such logic; that, more than the sensor count, was why the alarm wall had been ignored.

03 The deployment

We rolled the stack out across twelve substations on three contiguous fields over fourteen months. Each substation got the same eleven-channel suite, the same signature library, and the same SCADA integration — a single unified protocol the operators could learn once and apply everywhere. Standardisation across sites was, in retrospect, the single highest-leverage decision we made; the maintenance team that monitors twelve sites cannot keep twelve different conventions in their heads, and any failure of memory becomes a missed signature.

The hardware was deliberately commodity. We used the same three sensor families across every site (industrial-grade vibration accelerometers, fibre-optic temperature monitors, and a single bushing-current sensor per transformer), spec'd to a price point we could repeat, and refused to add specialised channels even when individual operators argued for them. Specialisation per site would have killed the unified stack.

SUBSTATION · STANDARDISED ELEVEN-CHANNEL SUITE Switchyard Switchgear T1 T2 T3 T4 RUST RING = TRANSFORMER WITH SENSOR PAIR · 3 OF 4 INSTRUMENTED · T4 ON STANDBY DUTY · ELIDED
Fig. 1 Stylised single-line for one of the twelve substations. The rust rings mark the transformers that carried the standard sensor pair; the standby unit (T4) was deliberately not instrumented because its duty cycle made the diagnostic signal unreadable. Illustrative diagram; site-specific topologies vary.

The result, after the third site went live, surprised us. The first signature firings were not the dramatic ones we had built the system to catch. They were the boring ones: oil temperatures drifting up by a few degrees over weeks on transformers that had been freshly serviced, vibration baseline shifts that turned out to be unbalanced loads from new wells coming online, partial-discharge events that traced back to corona around damaged insulators on the medium-voltage feeder rather than the transformer itself. The system was doing useful work, but it was finding things adjacent to what we had instrumented for, not the things themselves.

This is, I now think, the most underappreciated property of a well-designed condition-monitoring stack: the diagnostic value comes as much from the things the system tells you it cannot diagnose as from the things it can. A signature that consistently fires for a real cause but a wrong location forces the team to investigate; an investigation, conducted with the network's coverage made visible, almost always finds something worth knowing. The unified protocol made the investigation cheap, because the field engineer at site nine could read the signature exactly as it had been read at site three a year earlier.

The data work in the first six months was non-trivial. We did not have a labelled fault dataset; what we had was three months of continuous recording from three sites and an operator team willing to walk us through every alarm event with hindsight. The signature thresholds and dead-bands were tuned by hand against that record, in close conversation with the senior maintenance engineer at the first field. We resisted, deliberately, the temptation to fit the thresholds tightly to the available data — a tight fit on three months of one site is a recipe for false-positive surges when a new site comes online with a slightly different baseline. The thresholds we shipped were intentionally loose. They produced fewer alarms than a tight fit would have; they produced almost no false positives at site five and beyond.4In the academic language: the signature library was tuned for high specificity at the cost of some sensitivity. The operators thanked us; analysts looking at it on paper sometimes asked why the recall was not higher. The answer is that recall is the wrong metric when the cost of a false positive is a maintenance trip.

The commercial structure of the engagement is the part most condition-monitoring write-ups skip, and the part that most strongly determines whether the work survives. Each substation contract was structured as a fixed-price installation plus a five-year support agreement. The support agreement paid SSL to keep the signature library current — which meant we had a recurring obligation to actually look at the data, refine the dead-bands as conditions changed, and write up new signatures when new failure modes emerged on the fields. Without that recurring obligation, condition-monitoring stacks rot quickly: the alarm thresholds set at install year drift out of relevance with every change to the fields they supervise, and within three years the system either alarms constantly (and gets ignored) or alarms rarely (and is mistaken for working). The support agreement turned the system from a one-shot deliverable into a living artefact. I now consider the recurring-obligation structure to be the single most important commercial decision on a project of this kind, and I argue for it on every comparable engagement I am asked about today.

04 What was different from aerospace

The most obvious difference between substation work and the aerospace research that followed was the standardisation latitude. In oil-and-gas, the substations are similar enough that one stack fits twelve. In aerospace, every airframe is the asset, and the network is part of its certification basis; the latitude to standardise across platforms is much narrower. The optimisation framework I would later develop — MOSOF — is built for the aerospace case, where the network is bespoke per platform and the challenge is making the bespoke design defensible. The substation case did not need that machinery. It needed standardisation, signature logic, and the discipline to refuse to instrument what the algorithm could not act on.

The second difference was data volume. The substations produced perhaps thirty channels per site sampled at a few Hz; over twelve sites that's still small enough to inspect with eyes and a moderate database. Aerospace ECS data is two orders of magnitude richer. Different problems; different answers.

The third difference was the operator. A substation maintenance team in the field is not a laboratory full of researchers; the system has to be readable by people who are not optimisation specialists, and the operator's mental model is the one the system has to fit. Every signature we built had to be explainable in one paragraph to the engineer who would act on it. That is the design constraint that survives, and the one I have brought into the aerospace work most explicitly.

A note on what this case did not include

No machine learning was deployed. The signature logic was hand-tuned by domain experts, the dead-bands were rule-based, and there was no continuous-learning component. This was the right answer for a 30-channel, 12-site network; ML overhead would have made the system harder to operate without making the diagnostics meaningfully better. Both AcoustR and the aerospace work are different — there the channel count and the scale of the problem make ML defensible. Match the technique to the problem, not to the era.

05 What it cost to do

Per substation, the eleven-channel suite landed at roughly $14k installed, on hardware that has now been operating for about eight years on most of the original sites. The signature library plus the SCADA integration were a one-time investment of perhaps four engineer-months across the firm; rolling out additional sites after the first three was largely about installation and cabling.

The bigger cost was upstream — the discipline of refusing to add channels even when individual operators argued for them. That discipline was paid for in conversations, not invoices, and it is the cost most easily under-budgeted on programmes of this kind.

06 What it'd cost you

If you operate a network of broadly similar industrial assets — substations, pump stations, compressor halls — the playbook from this case generalises in three steps:

  1. Name the failure modes you actually care about, before specifying any sensors. Six to ten is plenty. If your list runs longer than that, the list itself is probably the problem.
  2. Pick the smallest sensor suite whose channel pairs and triplets distinguish those modes. The right size is determined by the modes; the wrong size is determined by the budget.
  3. Standardise it across every comparable site. Resist per-site exceptions. The maintenance team's cognitive load is the limiting reagent.

For an existing network, the rough order of work is two months to specify the modes and the suite, three months to roll out across the first sites, and a year of operating data before the signature library is well-tuned. Compute is negligible. The expensive part is the same as in aerospace — the framing, the discipline, and the willingness to say no to channels that would make the dashboard look more impressive without making any actual diagnosis better.

The case I have written here is past tense — SSL was sold in 2018, and the stack has been kept in service rather than evolved. If you would like the same thinking applied to a current network, the consulting page is the starting point. The work generalises better than it should; the failure modes change, the framing does not. The discipline is the part that travels — name the modes, pick the smallest sensor suite that distinguishes them, standardise across sites, structure the commercial relationship so the system stays alive — and the rest is implementation detail that any competent integrator can do once the framing is right. The framing, in my experience, is the part that is almost never right when I am called in to look at an existing programme.

Endnotes

  1. In the academic literature this is sometimes called feature-level fusion as opposed to data-level fusion. The practical justification is that operators read signatures and act on them; nobody acts on a raw temperature trace.
  2. The eleven channels: three vibration (transformer, pump, fan), four temperature (oil, ambient, two winding spots via fibre), one bushing leakage current per transformer, one partial-discharge UHF receiver in the switchgear room, two switchgear envelope sensors (humidity, SF6 density). Per substation; smaller sites used a subset.
  3. SSL Elektrik-Elektronik was founded in 2013 in Adıyaman, Türkiye, and operated until 2018. The substation contracts referenced here ran continuously across most of that period.
  4. In the academic language: the signature library was tuned for high specificity at the cost of some sensitivity. The operators thanked us; analysts looking at it on paper sometimes asked why the recall was not higher. The answer is that recall is the wrong metric when the cost of a false positive is a maintenance trip.