We’ve heard it all the time. An SRE facing many alerts and repetitive operation works decides to build a quick solution. They connect their alert system API to an LLM, provide some recent error logs, and set up a chatbot. After triggering a test alert, the bot creates a clear summary.
This looks like a quick win for the team. During the next engineering meeting, someone might ask why the company pays vendors for an AI tool when the team built one so quickly.
They have a valid point because they did build a working prototype. However, LLMs can make it look easier than it really is to turn a small demo into a reliable, scalable and self-improving platform. A prototype runs in a deterministic setting, whereas a production tool must handle ever-changing complex environment.
First, Figure Out Which Problem You Actually Have
The AI SRE industry often treats all problems the same, missing the structural divide between low-scale and high-scale environments. Before evaluating any tool, including ours, you need to correctly diagnose which environment your team operates in by looking at your team's behaviors.
The High-Scale Checklist:
- The Reflex: When an alert fires at 3 AM, does your first instinct send you to a Grafana or Datadog dashboard to find the system issue, or directly to your terminal to find what changed in the code? If it is the dashboard, you are operating in a systems world.
- The Vocabulary: When your team debriefs after a critical incident, does the conversation center on P99 latency or stack traces?
- The Priority: Is your immediate goal mitigation (getting traffic back to green via load shedding, rollbacks, or feature flags) or resolution (pushing a code fix)? If you triage before you debug, you are in systems territory.
- The Dependency: Do your most difficult incidents originate in upstream services your team did not write, cannot easily see into, and cannot deploy?
- The Event: Has your system ever failed because of a traffic spike, rate limiting, or an expired external dependency in code that had not changed in weeks?
Two Clearly Defined Worlds
Low-Scale (The Code Problem): If none of this resonated, your incidents likely live in the IDE. You operate in a deterministic environment with clear cause-and-effect chains. If this is you, you should build your own tool by customizing popular coding agents like Claude Code. Building internally is a fully viable strategy here.
High-Scale (The System Problem): If you recognized your team in two or more of those scenarios, you have a high-scale system problem. You deal with telemetry-first debugging workflows and non-deterministic failures driven by cascading dependencies. Turning a Friday prototype into a production AI SRE in this environment is a specific kind of trap.
The trap exists because most internal builds are, by necessity, "Code-First." As engineers, we build with the tools we know. We connect an LLM to our repos, set up RAG over our Confluence pages, and call it an SRE. Without realizing it, we’ve built a world-class coding assistant, but we’ve left it blind to the underlying infrastructure.
Why Code-First Tools Struggle at Scaled Problems
High-scale triage starts across mountains of telemetry data to locate the issue. High-scale triage starts across mountains of telemetry data to locate the issue. For example, you might see a spike in error rates reported by your load balancer impacting end users. At first glance, nothing points to a specific service. With hundreds of micro-services behind that entry point, you begin drilling down through dashboards—filtering by region, service, and dependency—to isolate the culprit. Eventually, you trace the issue to a downstream service that depends on a third-party API experiencing intermittent failures. Your code has not changed, and there is no bug to fix, but the system is unstable because of an external dependency. A code-first AI is blind to this category of failure.
Furthermore, the immediate goal during a critical incident is mitigation. You load shed or roll back to preserve the rest of your traffic while the investigation continues. A code-first tool lacks this priority ordering.
Finally, there is the issue of localized knowledge. In many high-scale organizations, a small percentage of engineers drive the majority of incident progress because they know which dashboard to check first or how specific services interact. This knowledge lives in Confluence pages, Slack threads, and their own mind. In high-scale environments, critical operational knowledge lives outside the code repository. An AI that only reads code misses this critical context. At high scale, you need an observability-native tool.
This brings us to the question engineers ask next: "Why can't we just build that internally?"
Three Common Challenges of Internal Builds
Challenge 1: Scaling institutional knowledge across teams is complex.
- The Assumption: Creating a basic Retrieval-Augmented Generation (RAG) layer over internal documentation or hardcoding runbooks is a straightforward project.
- The Reality: Building a localized AI tool for one specific team is manageable. However, these solutions often struggle to scale across an organization with diverse and independent services.
- The Evidence: Well-resourced engineering teams have encountered this exact issue. In some cases, different teams within the same company build separate, isolated versions of an AI assistant. Because the knowledge is localized, a tool built for the payments team cannot communicate with a tool built for the checkout team. This fragmentation causes a loss of cross-team context during incidents. Furthermore, static RAG systems quickly become outdated. Addressing this effectively requires a centralized and continuously updated knowledge graph.
Challenge 2: Building effective self-learning and feedback loops is resource-intensive.
- The Assumption: Launching the initial version of the tool is the most difficult phase, and subsequent work is mostly routine maintenance.
- The Reality: The initial build is often the most straightforward part of the process. Developing a system that continuously learns and adapts to changing infrastructure over time is much more difficult.
- What is Required: A production-ready AI tool needs to process both explicit signals (like an engineer flagging an incorrect suggestion) and implicit signals (like a team ignores the AI output and goes down a different path). Without an intelligent way to capture these feedback loops, the tool's effectiveness becomes difficult to measure objectively, and its performance can degrade as the system evolves.
- The Evaluation Problem: To verify that updates actually improve the model's accuracy, teams need to build a backtesting system. This involves replaying historical incidents against new model versions. Building a queryable incident history and a reliable evaluation framework often means shifting focus from building an SRE tool to building and maintaining a custom machine learning platform.
Challenge 3: Observability integration requires more than MCP.
- The Assumption: Connecting your observability stack to an LLM using an API wrapper or MCP is the primary hurdle. Once the model has data access, it will automatically provide useful insights.
- The Reality: Raw data integration is usually insufficient. Providing an LLM with a large volume of unfiltered observability data often creates more noise, which can slow down the triage process.
- What is Required: Effective tools need opinionated and specific data retrieval. The system must pull the exact right data context before it can do any useful analysis.
- The Proper Workflow: Consider how an engineer uses observability data during an outage. They do not just run a general text search for errors. They form hypothesis based on the symptom and they know where to find the relevant dashboard, queries and logs to validate them. They may use template variables to filter the view to a specific environment, region, and service version. For example, they might filter specifically for
environment=prodandservice=service-a-v2. - If an AI agent relies on simple API configurations or basic semantic search, it lacks this precision. It might see an alert for a service and pull logs for an older version, like
service-a-v1, or grab logs from a staging environment simply because the text looks similar. - The AI will then analyze this irrelevant data and attempt to diagnose the problem based on the wrong context. This is a classic garbage-in, garbage-out scenario that drives the agent down the wrong rabbit hole. To be useful, the system requires deliberate product design that enforces the same strict data filters a human relies on. This ensures that the model receives only the exact data it needs to investigate the issue.
The Long-Term Maintenance Cost
Engineering time is a highly constrained resource. When infrastructure teams spend their cycles managing model drift or maintaining an internal AI tool, it creates an opportunity cost that impacts core product development.
There is also the risk of internal tools becoming unmaintainable over time. If the original developers change teams or leave the company, the institutional knowledge required to update the tool often leaves with them. This can result in a system that teams rely on yet hesitate to modify. When a team decides to own a custom AI SRE tool, they also take on the long-term responsibility of maintaining the infrastructure it relies upon.
Own Your Problem Type
The AI SRE industry often markets a single, generic tool for every engineering team. It is more effective to evaluate your specific needs rather than adopt a one-size-fits-all approach.
If your incidents primarily occur in code, and you operate a small team with predictable problems in a standard environment, code-centric tools are a good choice. They are built specifically for that use case.
However, if your outages involve latency cascades, rate-limiting failures, and alerts in services you do not own, your infrastructure requires a different approach. Understanding the code alone will not recover a failing platform.
Building an internal tool is possible; some teams have even done it well. But it’s rarely a "side project." It’s a commitment to a multi-quarter roadmap.
The real question isn't whether your team can build a custom machine learning platform over the next 18 months; it’s whether they should. Every sprint spent tuning RAG performance or managing model drift is a sprint stolen from your core product. In 18 months, would you rather have a homegrown triage tool or a 1.5-year head start on the features that actually drive your company’s revenue?
At Bacca, we designed our AI SRE for high-scale system environments. We use a dynamic knowledge graph to capture and organize information across different teams. We apply observability-native reasoning with structured data retrieval to interpret complex system signals, similar to how an experienced SRE operates, rather than approaching it as a code review.
Let's Exchange Notes
If you are currently building an internal tool and encountering these challenges, particularly regarding feedback loops and continuous improvement, we would value the opportunity to compare notes. We have analyzed these failure modes extensively and enjoy discussing the technical details.
Alternatively, if your team aligns with the high-scale criteria and you prefer to bypass the internal build process, you can request a demo to see how Bacca handles complex triage in practice.
Eric Lu
Founder & CEO, BACCA.AI
we will contact you soon