The Dashboard Isn’t the Answer

It never was.

At some point in the last decade, the network and security operations industry collectively decided that the problem was visibility, and that the solution was dashboards. If you could just see everything, in one place, you could fix anything faster. So we built dashboards. We bought tools that promised a single pane of glass. We bought more tools when the first ones didn’t quite cover everything. We configured alerts, tuned thresholds, built views for every team and every use case. And after all of that, most organizations still have MTTD and MTTR numbers they aren’t proud of, and an ops team that spends a meaningful portion of its week managing the monitoring rather than managing the infrastructure.

The visibility wasn’t the problem. The assumption that more visibility, better configured, would translate to faster resolution was the problem. It doesn’t work that way, and the industry is slowly arriving at an honest reckoning with why.

I got an early look at where at least one major vendor thinks this is going when I attended AI Field Day 8 last month. Cisco presented their Intelligent Observability and AgenticOps story, along with an early demo of something called AI Canvas inside Cisco Cloud Control. A week ago Cisco formally unveiled Cloud Control at Cisco Live. Having seen it before the announcement, I think it’s worth talking about not just as a product, but as a signal about where operational UX is heading and why the direction matters.

The Tool Sprawl Trap

The single pane of glass pitch has always had a logical problem: every tool that promises it is adding another pane. Most mature operations environments have accumulated a collection of monitoring and observability platforms, each of which is genuinely useful within its domain, and none of which talk to each other in any meaningful way. When something breaks at the application layer, the workflow for figuring out whether the fault belongs to networking, compute, storage, security, development, or something else is mostly manual. You pull data from four different systems, you correlate it in your head or in a spreadsheet, you pass context to the next team, and that team starts from scratch because the context didn’t survive the handoff. Sure, eventually you’ll agree it was DNS all along, but it takes considerable cycles to get there.

The tools with the most ambitious feature sets make this worse in a specific way: when a platform gives you complete flexibility to build any dashboard you want, visualize any metric in any combination, and configure any alert threshold you choose, it is making an implicit assumption that the person operating it has both the domain expertise to know what to build and the statistical background to build it well. Most ops engineers have the domain expertise, but very few have the statistics background (and those that do are probably not paid enough). The result is that the dashboards either don’t get built at all, or they get built once by whoever was available at the time and then left alone, which is roughly as useful as not having them. Monitoring is not a set-it-and-forget-it system. Networks change, traffic patterns shift, what constituted a normal baseline six months ago may not be normal today. Dashboards that aren’t actively maintained are telling you a story about the past, not the present.

What happens in practice is that the dashboard becomes the work. Configuring it, troubleshooting why an alert fired when it shouldn’t have, arguing about thresholds in a change management meeting: these things consume cycles that should be going toward actual operational outcomes. We lose the forest for the trees. Visibility was supposed to be the means to an end, and it became the end itself.

A Different Design Philosophy

What Cisco showed at AI Field Day 8, and what they formally launched at Cisco Live, takes a different starting assumption. Rather than giving operators an infinitely configurable canvas and asking them to build the right view, AI Canvas generates the view dynamically in response to an active investigation. When an incident opens, the AI model starts pulling telemetry from connected products (networking, compute, security, observability), and as it identifies relevant signals, it renders widgets automatically in the shared workspace. You are not looking at a dashboard someone built in advance. You are looking at the data that is actually relevant to this specific problem, assembled in real time, as the investigation develops. Arun Annavarapu, Cisco’s Director of Product Management for Nexus Dashboard, described this as “gen UI”: the interface is generated by the context, not preconfigured by an admin.

This is worth dwelling on, because it is honestly the first time feel like I have seen “AI” added to a solution that feels like it is something more than a bolt-on “us too!” response. The default move when a software company decides to add AI is to add a chatbot. There is a text box somewhere on the screen now, and you can ask it questions. That is not a rethought user experience and I think most people would agree that we roll our eyes at it.

What Cisco has done with AI Canvas is ask a more fundamental question: if an AI model is actively working through a problem alongside a human operator, what should the interface actually look like? The answer they arrived at is that the interface should reflect the investigation in progress, not a static configuration someone made in advance. The natural language capability is there, but it is in service of a UX that would still represent a meaningful improvement even if you described it purely in terms of how context gets assembled and surfaced. That is a harder design problem to solve than adding a chat window, and it is a better answer to what operations teams actually need.

It is also worth acknowledging that Cisco, in one capacity or another, has been working toward this problem for a while. SecureX (which has since evolved into Cisco XDR) felt like a genuine attempt at cross-domain correlation, and it did (and does) some cool things. When a threat surfaced in one product, SecureX could trigger responses in others and assist in automatically correlating those events for faster triage. ISE’s Rapid Threat Containment does similar things, signaling across integrations with products like Secure Endpoint and FMC and others to contain problematic endpoints without waiting for a human to connect the dots. These aren’t trivial capabilities, and the directional thinking behind them was correct. The distinction here is that those systems are more or less event-driven and reactive, executing prescribed automation actions in response to specific triggers, and they are most effective when you’re operating substantially within the Cisco ecosystem. What AI Canvas is doing is different: it is contextual, investigative, and generative rather than reactive and prescribed. The agent is not executing a playbook when a human-defined condition is met. It is reasoning across domains, assembling evidence, and building an understanding of a problem it has never seen before in exactly this form. That is a meaningfully different capability, built on a foundation that technology like SecureX and RTC and others helped establish as possible.

Furthermore, the cross-domain correlation happens in parallel rather than serially. Cisco’s multi-agent architecture runs domain-specific agents simultaneously, one querying compute metrics through Intersight and another pulling fabric anomalies from Nexus Dashboard, while a supervisor agent handles correlation and reasoning across what both return. What may have taken hours of manual tab-switching and context-passing can instead surface as a coherent picture in minutes. And when that picture needs more than one person to act on it, you invite the relevant personas into the shared canvas board (your network admin, your security analyst, whoever the problem calls for), and everyone is looking at the same dynamically assembled view, with RBAC governing what each role can see and query.

There is also a quiet detail in the Cisco Live announcement that deserves attention: context persistence. The workspace maintains the full investigation history across shifts and escalations. If you escalate to a network admin three hours into a triage, that person joins a canvas board that already contains everything found so far. Nothing is re-explained. Nothing is reconstructed from memory or tribal knowledge. That is a real improvement to how operational teams actually function, and it helps address a systemic issue that shows up in almost every major incident postmortem.

The Change Testing Problem

There is a related problem that doesn’t get discussed as much as dashboard sprawl, but is just as operationally painful: testing changes before you make them. In theory, you have a lab environment that mirrors production closely enough to validate a change before it touches anything real. In practice, most organizations don’t have that, or they have something that was once close to production and has since drifted into a state that gives false confidence rather than genuine validation. Maintaining a real lab is expensive and requires the same care and feeding discipline that monitoring does (and often more OPEX/CAPEX), and it tends to fall to the same “we’ll get to it” fate.

apparently the AI image gen wanted to add christmas trees. my token ROI is negative.

Software development solved this problem through a combination of discipline and tooling that the networking world has largely not begun adopting until recently (and lets face it: it is still in its infancy). The DevOps movement codified something that should have been obvious but wasn’t always practiced: you don’t push to production without testing in an environment that reflects production as closely as possible. CI/CD pipelines automate that gate. Infrastructure as Code makes the environment itself version-controlled, reproducible, and testable. A change to a Terraform module can be validated in a staging environment before it touches anything live, and the pipeline enforces that discipline so it doesn’t depend on any individual engineer remembering to do it. The rigor is built into the workflow.

Network operations has mostly not had that. Changes go through change management processes that are often more about documentation and approval than genuine technical validation. Engineers with enough experience develop good instincts for what a change will do, and those instincts are usually right, which is a fragile way to run production infrastructure. The Digital Twin capability in Cisco Cloud Control brings the CI/CD philosophy to the network change workflow. The ability to model a proposed change against a live representation of the actual production environment, see its projected impact, and validate behavior before anything touches the network is the network equivalent of a staging pipeline. It is not a replacement for good change management practice, but it is a step towards closing a gap between how software teams and network teams approach risk that has been widening for years. Bringing those disciplines closer together is the right direction.

What This Is Signaling

If you step back from the specific Cisco implementation, there is a broader pattern I am reminded of. The architecture behind AI Canvas and AgenticOps is structurally the same decomposition pattern that drove the microservices movement in software engineering: domain-specific agents with bounded responsibilities, coordinated by an orchestrator, communicating through standardized interfaces like MCP. We traded monolithic applications for distributed services because bounded contexts and defined interfaces scaled better and failed more gracefully. We appear to be making the same shift in AI-assisted operations (and more systemically in AI-driven system architecture as a whole), trading monolithic dashboards and manual correlation workflows for distributed agent pipelines that can run in parallel, fail independently, and be extended without rebuilding the whole system. That decomposition also opens up something worth thinking about from a cost and efficiency standpoint: not every agent needs to be running on a frontier model with maximum parameters and maximum compute appetite. A domain agent with a narrow, well-defined job (checking fabric anomalies, querying compute metrics) can likely be right-sized to a smaller, purpose-built model that costs a fraction of what a general-purpose frontier model costs at scale. This is the beginning of an answer to the tokenomics problem that anyone running AI workloads at scale is already starting to feel, and it is a more structurally honest answer than simply hoping inference costs come down on their own.

That decomposition parallel comes with a duality: the microservices transition produced real gains, but it also produced service sprawl, ownership gaps, and observability problems that took years to untangle. The cloud migration wave produced the same pattern at larger scale: move fast, move everything, “replicate” the on-prem design in the cloud, figure out governance later. Many organizations that rushed to cloud (especially during the pandemic) are still paying for the decisions made in that sprint (environments nobody fully understands, costs nobody can fully account for, attack surfaces larger than anyone planned for). AI adoption is, by and large, being teed up the same way. The business pressure to deploy is already real, it is accelerating, and the governance frameworks are struggling to keep up – its important that we think about how to not repeat this pattern again.

Cisco’s AgenticOps model (the deliberate crawl-walk-run progression from observability to assisted operations to autonomous agents) feels like an honest way to answer at least some of that pressure. The framing is essentially: don’t skip the observability step, don’t skip the assisted step, build trust in the technology at each stage before handing it more autonomy. It is less dramatic than “just deploy agents,” and it is more likely to produce operational environments that people actually understand and can govern. The harder questions about runtime enforcement, agent identity, and auditability at scale are ones the whole industry is still working through. Cisco acknowledged the right primitives (RBAC, stateless domain agents, policy-bounded scope), but those are design-time controls. What “what exactly did the agent do, and was it authorized to do it?” looks like as a runtime answer, at production scale, with hundreds of agents running across a complex environment at speeds far exceeding that of human-driven workflows, is a question worth staying close to as this technology matures.

None of that is a reason to wait. It is a reason to move with intent, which is exactly what the crawl-walk-run framing is asking you to do. The direction Cisco is pointing (operational context assembled dynamically rather than configured statically, parallel agent investigation rather than serial human correlation, change validation through digital twin rather than improvised lab environments) is the right direction.

The industry has been solving the wrong problem for a decade – it is good to see us starting to solve the right one.

CCIE #TBD

A CCIE Study Journal

The Dashboard Isn’t the Answer

Published by malb9001

Leave a comment Cancel reply

Share this:

Related

Published by malb9001

Leave a comment Cancel reply