RelA: Reliability as a Control Problem

Reliability failures are rarely caused by missing mechanisms.

Most modern systems already have retries, timeouts, hedging, rate limits, and autoscaling. When systems still fail under load, it’s usually because these mechanisms activate too late, in the wrong order, or without shared context.

RelA - Reliability Autohealer - started as a small experiment to study that behavior.

The goal was not to build a faster circuit breaker or a production-grade reliability system. It was to isolate a simpler question:

What happens when reliability decisions are treated as a control problem rather than as static configuration embedded in application code?

Reliability is not a library problem

In many architectures, reliability logic lives inside SDKs or middleware. Each service independently decides when to retry, hedge, or shed load.

This works until the system is under stress.

At saturation, application code is competing for the same CPU cycles it needs to survive. Decisions about what to drop and what to preserve are made locally, without visibility into global system state or business intent.

This creates a form of priority inversion: the system spends effort deciding instead of stabilizing.

RelA explores an alternative framing—separating decision-making from request handling.

Control planes and feedback loops

RelA models reliability as a feedback loop:

Signals are observed from the system (latency, saturation indicators).
Decisions are made based on policy and priority.
Actions are enforced externally on the application.

This decouples the cycle of governance from the cycle of deployment. Decisions can change without redeploying code. More importantly, decision timing becomes explicit.

The experiment surfaced a recurring theme: feedback lag dominates behavior.

When signals arrive late or are smoothed over long windows, the system reacts to conditions that no longer exist. Recovery actions overshoot. Shedding continues after traffic subsides. Systems appear healthy while user experience degrades.

This is not a tooling issue but a control problem.

Decision ordering matters under load

Reliability mechanisms like retries, hedging, and load shedding are all locally correct. Under saturation, they compete. Without an enforced priority order, they can amplify failure instead of mitigating it.

RelA treats reliability actions as mutually exclusive modes, not independent toggles. Availability-preserving actions must dominate latency-improving actions when the system is in survival mode.

This is not an optimization. It is a safety constraint.

Reliability encodes product intent

Another observation from the experiment is that fairness is rarely aligned with business value.

Random load shedding treats all requests equally. In practice, systems often care about which requests succeed. Reliability decisions implicitly encode product priorities, whether or not the system acknowledges it.

RelA makes this explicit by allowing business context to participate in control decisions.

What this experiment is — and is not

RelA is intentionally minimal.

It does not attempt to establish absolute performance characteristics or production-scale benchmarks. Its purpose is to surface decision behavior, feedback dynamics, and failure modes that are invariant across scale.

The same patterns appear whether a system handles tens of requests or millions. Scale amplifies the cost, not the physics.

Why this matters

Most reliability incidents are not caused by the absence of mechanisms. They are caused by late decisions, conflicting actions, and stale signals.

Treating reliability as a control problem forces these issues to surface earlier—while the system still has options.