Why Knowledge Graphs Beat RAG for Incident Response

Dr. Xinran He

2025

Technology

min read

This is some text inside of a div block.

The PagerDuty alert blares. You join the incident bridge to find your team struggling. Dashboards scream about 5xx errors impacting customers, but they are looking at the wrong data.

You suspect a pattern—a phantom pain from an outage six months ago. The root cause wasn't an application bug, but the database silently running out of connections. The details of this problem and the subsequent fix were buried deep in an old Slack thread.

You redirect the team with that single piece of tribal knowledge. The incident gets resolved, but the fix was locked entirely in your head. You’re the hero, but you’re also the single point of failure.

This reliance on human memory and siloed expertise, aka institutional knowledge, is a significant obstacle in modern incident response. It doesn’t scale, it hinders onboarding, and it creates organizational risk when experts leave.

The Rise of the AI SRE

The industry is moving toward adopting an AI SRE: an intelligent agent designed to join incidents, automate investigations, and reduce Mean Time to Resolution (MTTR). When done right, an effective AI SRE can troubleshoot incidents before a human opens their laptop.

However, the effectiveness of this AI depends entirely on the knowledge it’s trained on. A base LLM possesses vast "world knowledge"; for example, it can explain how Kubernetes works in general. But to troubleshoot your environment, the AI needs specific institutional knowledge, a contextual understanding of how your services interact and how they have failed in the past.

The AI SRE's effectiveness relies on its data model. While many tools offer semantic search of internal documentation, BACCA.AI’s core approach is a proprietary Knowledge Graph. This graph is not a search index; it is a structured, evolving model of the system it protects.

The Limits of RAG in Incident Response

For automated incident response, standard Retrieval-Augmented Generation (RAG) isn't enough. While RAG is excellent at searching documents and surfacing relevant text, the results remain unstructured, fragile, and often unreliable. This makes RAG useful for providing context, but insufficient as the foundation for an AI SRE agent that must take precise actions. In contrast, a Knowledge Graph (KG) provides a structured, reliable representation of your system—mapping services, components, and their relationships in a way an agent can both understand and act upon.

Here’s the key difference:

RAG provides context. It answers "what" and "when" by retrieving snippets of text. But because the outputs depend heavily on the input query, they are highly variable, non-deterministic, and hard to control. In high-stakes incidents, this unpredictability is a major risk.
KGs enable actions. They encode "how" and "why" through a structured model of the system. A KG is deterministic, transparent, and controllable—ensuring the agent can consistently take the right action with confidence.

At BACCA.AI, we believe a KG unlocks the next level of reliability for automated response because it is:

Structured and Visible: Unlike RAG’s opaque retrieval, a KG explicitly models your system’s state and dependencies. Nothing is hidden; everything is inspectable and controllable.
Actionable: KGs don’t just describe; they connect problems directly to the right solutions—linking an alert to the correct dashboard, log query, or remediation step.
Reliable: Because it is deterministic, a KG offers consistent outputs, making it a trustworthy foundation for automation.

In short, for an agent to act reliably, it needs the deep, structured understanding that a Knowledge Graph provides. It's the difference between reading a manual and having an expert on hand.

Moving From Unstructured Data to Structured Understanding

BACCA.AI’s primary engineering focus is transforming unstructured operational data into a structured, proprietary knowledge graph. We treat an organization's history as raw material to be refined into a model that replicates the mental map of a senior engineer.

The nodes in this graph represent system components (services, databases, cloud resources like AWS RDS, third-party dependencies). The edges map the relationships between them.

The intelligence lies in how we annotate this graph with operational context extracted from historical data:

Relevant Telemetry Queries: Identifying the specific metrics, logs, and traces relevant for diagnosing a component, solving the "what to look at" problem.
Historical Failure Patterns: Learning which services are fragile and how they typically fail.
Human Investigation Steps: Extracting the exact actions (dashboard views, log queries) human engineers used to identify past root causes.

Building the Graph: A Continuous Process

Generating an accurate knowledge graph requires a deliberate, continuous learning process designed to capture an ever-changing environment.

The Initial Learning Phase (Day 1)

When first installed, the bacca agent performs an offline historical analysis. It ingests years of data to build the initial knowledge graph from key sources:

Observability Systems: Analyzing telemetry data (e.g. Datadog logs, Prometheus metrics) and sampling distributed traces to map service architecture and dependencies.
Historical Incidents: Postmortems and Jira tickets provide structured accounts of past failures and resolutions.
Unstructured Data: Much operational knowledge resides in Slack. BACCA.AI uses LLMs to extract structured knowledge (tips, shared dashboard links, debugging steps) from this history.

This ensures that when the AI SRE faces its first live incident, it does not start from a blank page.

Continuous Evolution

A static knowledge graph quickly becomes stale. The bacca agent addresses this through continuous learning. It observes every new incident in real time, analyzing how the human team diagnoses and resolves the issue—what commands they run, what metrics they examine, and what conclusions they reach.

This learning is implicit. You do not need to manually update runbooks. As you troubleshoot normally in Slack, the bacca agent ingests the conversation and the resolution path, refining the graph to reflect the current state of the system.

Acting Like an Expert

A senior SRE does not randomly check dashboards. They use their internalized understanding of the system to form hypotheses based on symptoms (e.g., "This latency spike looks like last month's traffic surge," or "This could be due to a dependency failure").

The knowledge graph enables the bacca agent to mimic your best experts’ hypothesis-driven way of working. When a new incident occurs, the bacca agent generates two types of hypotheses:

Past Failure Patterns: The agent first checks for historical precedent: "Is this a recurrence of a recent incident with a similar root cause?"
Context-Based Hypotheses: For novel issues, the agent uses its understanding of the system architecture to form relevant hypotheses based on the current context (e.g., a recent deployment or a failing dependency).

Using Data for Validation

The bacca agent does not just guess. It is agentic. Once hypotheses are generated, the AI validates them by executing read-only actions in real time. Using the learned investigation steps stored in the knowledge graph, bacca automatically queries logs, fetches metrics, and checks traces.

This allows the AI SRE to present a validated hypothesis backed by evidence. Instead of saying, "Here is a similar past incident," bacca says, "The root cause is likely DynamoDB throughput exhaustion because I am seeing 'resource exhausted exception' errors in your logs right now, correlated with a spike in read QPS. Here’s the link."

All conclusions are referenced and linked to the specific metrics and logs within the customer's own monitoring dashboards (e.g., Datadog, Grafana), ensuring transparency and trust.

Accelerating Operational Maturity

The knowledge graph accelerates operational maturity. A new engineer often takes three months or more to learn enough institutional knowledge to be effective during a major incident.

BACCA.AI compresses this timeline. The initial learning process builds the knowledge graph in one day, ingesting years of historical context. The bacca agent provides value immediately, starting with the collective expertise of the organization's best engineers. This frees up senior engineers from repetitive incident response and shortens the time to a more reliable state.

The Foundation for an AI Partner

The promise of an AI SRE is not just faster searching. It is the creation of an intelligent partner capable of autonomously understanding and diagnosing complex technical issues.

While LLMs are necessary, they are insufficient without context. A standard RAG system is a search index. BACCA.AI's Knowledge Graph is much more. It’s the collective knowledge of how your best engineers operate your software.

Reach out if you’d love to learn more!

‍

Dr. Xinran He, Co-founder & Chief Scientist

BACCA.AI

Content

Text Link

Successfully sent,
we will contact you soon

Oops! Something went wrong while submitting the form.

Link copied to clipboard