The Illusion of Control: Rethinking Risk in an Uncertain Maintenance World

Indicators behave within expected limits, processes unfold without friction, and the system feels readable, as if its logic is fully captured and brought under control. Nothing appears to be wrong—and that is precisely the problem. It is a comforting illusion, one engineering has cultivated with remarkable success.

Yet systems rarely betray themselves in obvious ways. Failures begin not with rupture, but with deviation—small, almost imperceptible shifts. A parameter drifts, a delay emerges, a dependency changes. Individually, these signals appear harmless. Together, they form patterns that are difficult to perceive, especially when observation tools continue to confirm what we expect to see. As long as these remain consistent, the system appears stable. But consistency is not truth: a system may remain coherent while gradually detaching from the reality it is meant to describe.

ADVERTISEMENT ENDS

Maintenance has long been tasked with preserving that coherence, and for a time it succeeded. But today’s systems extend beyond what can be observed, entangled with volatile supply chains, energy systems, and evolving constraints. Control becomes less a property and more a temporary alignment between expectation and reality—one where risk does not emerge suddenly, but grows silently until it can no longer be ignored.

Perception remains stable while the system quietly diverges from reality.

Legacy of Criticality: Maintenance has reinforced the illusion of control. It evolved to bring structure to uncertainty, transforming what once seemed unpredictable into something that could be classified and managed. The concept of criticality became one of its most important foundations. At its core, criticality offered a simple idea: not all assets matter equally. Some failures are negligible, others disruptive or costly. By ranking these differences, maintenance moved from reacting to failures to anticipating them, focusing effort where it mattered most. Criticality became the bridge between technical analysis and practical action, allowing organizations to allocate resources in a way that felt both efficient and justified.

For a long time, this approach worked remarkably well. In stable environments, where dependencies were understood, failure probability and consequence could be assessed with confidence. The system behaved within known limits, and maintenance strategies reflected that stability. But this stability was an assumption.

As the environment shifts, the assumption weakens. Assets once considered secondary can become critical—not because they change, but because the system around them does. A component gains importance when supply chains tighten, or processes lose their tolerance to disruption. Criticality does not disappear, but it becomes less stable. It no longer reflects an intrinsic property of the asset, but a relationship between the asset and an evolving context. When that context is not considered, decisions rely on assumptions that may no longer hold.

The model defines importance—but assumes a stability that no longer exists.

The concept itself is not flawed. It remains essential. But its traditional use implies a system that changes slowly, where priorities can be defined and periodically reviewed. What we now face is something different—a system in motion, where relevance is shaped not only by the asset, but also by its connections and dependencies.

Risk Is Not What We Thought: For a long time, risk appeared to be a concept we had domesticated. It could be expressed, calculated, and compared—probability on one side, consequence on the other—yielding a measure that could be ranked and acted upon. There was a sense that uncertainty, once quantified, could be contained within rational decision-making.

To a certain extent, this was true—at least, within the world for which that logic was designed. But as systems become more complex and less stable, this formulation captures only part of what risk actually is. It describes the likelihood of an event and the scale of its impact, yet says little about the conditions that make that impact manageable or catastrophic.

For decades, the role of maintenance was clear: prevent failure, reduce downtime, and optimize cost within stable operating conditions.

In practice, the same failure can lead to very different outcomes. A component may fail under identical conditions, yet its impact can range from negligible to severe—not because the failure has changed, but because the system that receives it has. What was once absorbed without difficulty may now propagate across tightly coupled processes. The event is the same; the system is not. Risk, therefore, is not simply a property of the asset or the failure mode. It emerges from the relationship between them and the environment in which they exist. It is shaped by interaction, dependency, and timing. It is fundamentally contextual. Like in Solaris, the system cannot be understood in isolation from the conditions that shape it.

This does not invalidate the traditional definition, but it exposes its limits. Probability and consequence remain essential, but they are no longer sufficient. Another dimension drives risk: how exposed the system is, how capable it is of absorbing disruption, and how dependent it has become on elements beyond its control. In practice, this dimension is often sensed rather than formalized.

When a maintenance engineer prioritizes an asset not because it fails often, but because “if it fails now, we are in trouble,” what is being assessed is not just likelihood or impact, but vulnerability—the system’s ability to cope at that moment. The challenge is not to calculate risk more precisely, but to understand it more completely.

The System Fights Back: At this point, the system stops behaving as we expect it to. The complexity of industrial systems is both structural and relational. Technical components, human decisions, organizational processes, and external constraints form a network where causality is distributed rather than localized. Failures rarely stem from a single cause; they emerge from multiple conditions aligning in ways difficult to anticipate. This challenges the fundamental assumption in maintenance that understanding individual failure modes is enough to understand system behavior.

Risk is not defined by failure alone, but by the system that receives it.

It is still necessary, but is no longer sufficient. The system cannot be fully explained by its parts, because interactions between those parts generate behaviors that do not exist at the component level. The human is part of the network. The interpretations and actions of operators and decision-makers shape how failures unfold—sometimes stabilizing the system, sometimes contributing to its degradation. The boundary between human and technical elements is not fixed, but continuously negotiated.

Maintenance models still focus on what can be measured—failure rates, repair times, condition indicators—but critical dynamics emerge in layers that are harder to capture: timing, coordination, perception. This does not make control impossible, but it redefines it. Control cannot be achieved through reduction and classification. It requires understanding how the system behaves as a whole, how interactions evolve, and how risk is distributed across the network.

No longer confined to components or failure modes, risk is embedded in the system’s structure and dynamics. Managing it requires moving beyond isolated events and recognizing the system as an active, evolving entity—one that does not simply respond to interventions, but reshapes them.

Data Are Not the Answer: When complexity increases, the instinctive response is to gather more information. Systems that are harder to understand are observed more closely, measured more extensively, and monitored through an ever-growing network of sensors. Over time, this has led to an expansion in visibility: patterns can be detected earlier, degradation tracked in real time, and interventions scheduled with increasing precision. In this sense, the system appears more transparent than ever before.

The real failure is not when an asset stops working, but when the system cannot respond in time—when the gap between expectation and reality becomes too wide to manage.

But this transparency can be misleading. Data reveal what is happening within the asset, but not necessarily what it means for the system as a whole. A signal may indicate a developing fault, but its significance depends on factors beyond the data itself—operational constraints, dependencies, and the system’s ability to respond. Data require interpretation to become useful. Without context and judgment, more information does not bring understanding closer. Paradoxically, while failures are detected earlier, responses remain constrained by frameworks designed for a different context.

This is not a failure of technology, but of interpretation.

When the signals of Condition-Based Maintenance, for example, are not integrated into a broader understanding of risk—one that includes context and exposure—their value is limited. They inform, but they do not transform. What emerges is a distinction between knowing more and understanding better. The former depends on data and infrastructure, the latter on how that information is framed and interpreted. In continuously evolving systems, data alone cannot provide stability—they can only reflect instability with greater precision.

Maintenance as Resilience: For decades, the role of maintenance was clear: prevent failure, reduce downtime, and optimize cost within stable operating conditions. But when those conditions change, the problem itself changes. In uncertain environments, the question is no longer simply how to avoid failure, but how to continue operating when it occurs in unexpected circumstances. The focus moves beyond prevention toward the system’s ability to absorb disruption, adapt, and recover without losing coherence—what we now understand as resilience.

A resilient system is not one that never fails, but one that does not collapse when it does. It continues to function, even under degraded conditions, without crossing into instability. The distinction reshapes the role of maintenance from doing the right things under expected conditions to doing the right things when those conditions no longer hold. Resilience introduces a temporal dimension, where decisions are not only about preventing failure, but also about how the system responds over time.

When dependencies fail or conditions shift, the question becomes whether the system can adapt without triggering further disruption. This requires flexibility—operational, organizational, and technical. Maintenance, in this sense, extends beyond assets. It becomes a capability that supports the organization’s ability to respond under uncertainty. Beyond reliability, decisions about maintenance influence continuity and performance over time.

Failures do not propagate—they interact within a system that reshapes them.

Systems optimized purely for efficiency may perform well under stable conditions, yet become fragile when conditions change. Resilience acts as a counterbalance, ensuring adaptability in the face of uncertainty. The challenge is to design systems that are both competitive and capable of enduring disruption.

A New Logic: If maintenance is to support resilience, the way risk is understood must also evolve. Not by discarding existing principles, but by extending them beyond the limits for which they were designed. Probability and consequence remain essential, but they no longer capture the full dynamics of how disruption unfolds. What has been missing is not another variable, but a recognition that the significance of failure depends as much on the state of the system as on the failure itself. The same event does not carry the same weight under different conditions. At times, the system can absorb disturbance; at other times, even minor deviations can trigger disproportionate effects. Therefore, risk can no longer be treated as a fixed attribute. It is a condition that evolves with the system, reflecting not only what might happen, but also how prepared the system is to respond. A likely failure may represent little risk if the system is resilient, while a rare event may become critical if exposure is high.

Admittedly, this perspective makes decision-making more demanding. It requires continuous interpretation, the ability to reassess assumptions, and the integration of information that does not fit neatly into predefined categories. Maintenance strategies can no longer be static; they must adapt as the system evolves. From a practical standpoint, it is not enough to ask how likely a failure is, or how severe its consequences might be in general terms. The relevant question becomes: what does this failure mean for the system, here and now?

Answering this requires more than data. It requires a framework that integrates context with analysis, bridging the gap between technical knowledge and operational reality. This is not a departure from established practices, but a reorientation. Reliability and condition monitoring remain essential, but their value depends on how they are connected and interpreted. Risk is no longer static, but dynamic—continuously shaped by the system it describes. In this connection between what the system is and what it is becoming, maintenance begins to act not as a technical function, but as a strategic capability.

The Real Failure: The world maintenance was designed for has quietly changed. The principles that once provided clarity remain valid, but the system they describe no longer behaves in the same way. Its boundaries have expanded, its dependencies have multiplied, and its behavior has become less predictable. In this landscape, the greatest risk is not failure itself, but the persistence of assumptions that no longer reflect reality. Systems are still managed as if conditions were stable, even as they continue to evolve. The meaning of failure shifts with the system, often without being fully recognized.

The more precise our tools become, the stronger the belief that uncertainty is under control. Yet the system reveals new forms of unpredictability—not because it is less understood, but because it is more interconnected and exposed to forces beyond its immediate structure. Control must be reconsidered, not as something that can be fully achieved, but as something that must be continuously negotiated. Maintenance is no longer about preserving stability, but about enabling the system to navigate change without breaking. This shift extends beyond the technical domain. It affects how decisions are made, how risk is perceived, and how efficiency is balanced with adaptability.

Failure itself does not disappear. What changes is its significance. The focus moves from the event to the system’s ability to absorb and respond to it. The real failure is not when an asset stops working, but when the system cannot respond in time—when the gap between expectation and reality becomes too wide to manage. This does not provide certainty, but a different perspective: one that accepts that systems evolve, risk is contextual, and control is always partial. Perhaps that is where the discipline must now position itself—not in the confidence of having mastered the system, but in the awareness of how easily that confidence can be misplaced. The difference is whether we question it in time—or only once it breaks. That silence is still there—just harder to recognize.

Text: Prof. Diego Galar
Photo: shutterstock

Latest

Contact Us

Subscribe to the free Maintworld newsletter here!

Latest

Subscribe to Maintworld Newsletter