Before observability, there was monitoring - and it worked well for a long time. Traditional IT systems were mostly monolithic. In monolithic systems, monitoring rested on a simple premise: if you know what matters, measure it. Operations and infrastructure teams instrumented servers for CPU usage, memory, disk I/O, and network throughput. Application teams tracked error rates, request latencies, and transaction volumes. Monitoring relied on logs, threshold alerts, and dashboards. If CPU crossed 80%, an alert fired. If a server went down, paging systems triggered notifications. Threshold alerts notified on-call engineers when a metric crossed a boundary. Dashboards assembled these signals into a view that made system state readable at a glance
This worked well in an era of monolithic applications, where systems were relatively simpler and cause and effect were closely coupled. If a database failed, the application failed in predictable ways. When a product ran as a single deployable unit on a small number of servers, failure states were bounded and largely understood. If the application slowed down, you checked CPU. If users could not log in, you checked the authentication service. If orders failed, you checked the database connection. The causal chain was short and failure modes were known.
Engineers defined known failure conditions in advance. As long as the system stayed within expected boundaries, monitoring worked. Rule-based monitoring encoded this “known” knowledge explicitly. Engineers wrote alerts based on experience: "if error rate exceeds five percent for three minutes, page the on-call team." These rules worked because failure modes were known, repeatable, and stable.
But this model had a hidden assumption: that we know what can go wrong.
As systems evolved, that assumption began to break. Distributed architectures introduced partial failures, cascading dependencies, asynchronous behaviour, and non-deterministic interactions. Failures no longer followed simple patterns. As systems decomposed into services across cloud regions with containers and orchestration, failure modes exploded. No one could write rules fast enough. A latency spike might involve seventeen services, three external dependencies, a deployment, and unusual traffic patterns - none of which a single dashboard could show together.
Suddenly, dashboards showed “everything green” while users experienced outages. Alerts fired too late or too frequently, causing fatigue. Root causes were no longer obvious.
Monitoring remained effective for known failures, but was less effective at explaining unexpected ones.
The core limitation was structural: monitoring answers predefined questions, while modern systems require exploration of unknown unknowns.
This gap set the stage for a different paradigm - one that could handle complexity not by predicting every failure, but by enabling deeper understanding when failures inevitably occur.
← All posts