Your Next Hire Should Be an AI SRE: Fixing Software Reliability in the Age of AI

Eric Lu

2025

Product

min read

This is some text inside of a div block.

I love how advancements in AI let us build and ship software faster than ever before. But that feeling of progress vanishes the moment a 2 a.m. page goes off. All that innovation feels miles away when you’re wrestling with an outage, relying on manual processes and a handful of exhausted experts.

Building software is the creative part of our job, but operating it is often where the real struggle lies. If you're like me, you've seen engineering teams spend more time putting out fires than shipping new features. Every incident grinds strategic work to a halt, creates operational toil, and leads straight to burnout.

For years, the industry’s solution was to throw more data at us. We got more logs, more metrics, and more dashboards. But this firehose of information isn't always helping. During a crisis, the bottleneck isn't the data; it's the cognitive capacity of the human brain trying to make sense of it all. We don't need another dashboard. We need a new kind of teammate.

The Human Bottleneck

In almost every engineering organization I’ve seen, there are a few "operational gurus." They're the seasoned engineers who have seen it all and know where all the bodies are buried. This reliance on tribal knowledge has always been a risk, but as AI-driven development accelerates the creation of new code and services, it has become an operational crisis waiting to happen.

This creates what I call the "90/10 problem." During an incident, the top 10% of your responders drive 90% of the progress. As one manager told me, the challenge is that it's very hard to bring the other 90% of the group up to speed quickly.

This dependency on a few heroes is a fundamental crack in the system. When critical operational knowledge only exists in old Slack threads, scattered documents, or the minds of your most senior people, it’s out of reach during a crisis. This vulnerability is magnified as more teams adopt a 'You Build It, You Run It' model.

When an incident strikes, you can see the hesitation. Engineers stare at a wall of dashboards, unsure where to even begin. They worry about making things worse, so they freeze, waiting for one of the gurus to point the way. To fix this, we need to shift the operational burden from individual humans to a solution that helps everyone.

How Experts Really Solve Outages

In the chaos of an incident, if you're not an expert on the entire system, the natural first step is often to open dashboards and dig through logs, frantically searching for a signal in the noise. This is the core idea behind traditional AIOps tools, which aim to surface or group data in a more intelligent way for human operators.

The flaw in this approach is that it tries to find the needle in the haystack by sifting through every piece of straw. A seasoned expert almost never works this way.

Think about your most effective engineer. When a system breaks, they don’t start by randomly sifting through terabytes of logs. They operate from the top down. They begin with their deep understanding of the system, their tribal knowledge, and quickly form a few likely hypotheses. "I bet the caching layer is saturated," or, "This feels like that DynamoDB issue from last month." Only then do they use data to prove or disprove their theories.

This hypothesis-driven approach is incredibly effective, but it has a critical vulnerability. Its effectiveness depends entirely on tribal knowledge locked away in one person's head. This is the heart of the 90/10 problem and the reason traditional AIOps tools often miss the mark. These tools focus on automating the data analysis that experts rarely use and leave the most important parts of incident response completely untouched: forming a good hypothesis and knowing where to find the right data.

An AI That Thinks Like Your Best Engineer

What if you could build a system based on this expert model? This is the core idea behind what we’re building at BACCA.AI. We started by asking: could we create an AI that learns to think like your best engineer, not just analyze data like a machine?

We took a different path from traditional AIOps. The bacca agent begins by forming hypotheses based on knowledge, rather than just crunching data to find clues.

First, bacca dives into your team's institutional memory. It reads the places where work really gets done, like Slack channels, Confluence pages, and old Jira tickets. From there, it analyzes your service catalogs and distributed traces to build a map of your architecture. When an incident happens, bacca uses both its architectural map and its historical knowledge to form smart hypotheses, just like an expert would, before digging into observability data to validate them.

This "hypothesis-first" model turns the AI from a simple data analysis tool into a true companion, a virtual expert available to everyone on your team.

For example, instead of just flagging a CPU spike, bacca might propose: 'Hypothesis: The recent deployment of the auth-service has caused a connection leak in the user database, similar to the incident from last July. I am checking database connection pool metrics to validate this.' This kind of actionable hypothesis is designed to quickly lead the team to the root cause of the issue.

Human-Centric AI in Practice

An AI SRE should be built to augment your team and earn their trust. At BACCA.AI, this philosophy translates into practical features designed to fit your workflow and provide concrete value from the start.

It All Comes Down to Trust

We knew that for this to work, bacca had to show its reasoning. Every conclusion it offers is backed by clear evidence, with direct links to the supporting logs or metrics in your tools. Your team can instantly verify its thinking and act with confidence.

A Teammate That's Always Learning

Like any great engineer, bacca is constantly learning. It stays up-to-date with every change to your system’s architecture and how it's operated. We also built feedback loops directly into the process, because trust is earned through learning. Your team can give bacca a quick thumbs-up 👍 or thumbs-down 👎 on its suggestions, which helps tune its behavior.

More importantly, bacca learns from outcomes. After your team resolves an incident and identifies the root cause, bacca looks back at its own analysis. It compares its proposed causes with the ground truth, and uses that lesson to become more accurate for future incidents.

Playbooks That Actually Work

We've all seen static runbooks that are out of date the day they're published. The bacca agent generates playbooks automatically, based on how your team actually solved past incidents. These dynamic guides are synced to your team's GitHub repository, so you get the benefit of AI knowledge capture with the control of a versioned, human-reviewed process.

An Assistant That Fits Your Workflow

Your team has its own way of working, and we designed bacca to fit right in. It’s a Slack-native assistant that joins your existing incident channels. It meets engineers where they already are and adds value to your current process.

Breaking the Cycle of Firefighting

An AI SRE should do more than just help during an incident; it should help prevent the next one. Because bacca participates in every incident, it generates operational reports that spot your top failure patterns. This allows you to make targeted improvements to your stack and reduce your overall incident volume. We've seen this work for organizations like Snap, which cut their incidents by 34%.

The Future of Reliability is Collaborative

The complexity of our systems will only continue to grow. The path forward is to empower our skilled engineers with tools that scale their expertise. A human-centric AI SRE, built to think like an expert, can help manage this complexity.

By automating knowledge synthesis and providing hypothesis-driven guidance, we can finally bridge the 90/10 gap. We can make every engineer on your team respond as if they were the most seasoned one. Our goal is to make incident response less about heroics, less painful, and more about solid engineering, so your team can sleep better and get back to building what’s next.

‍

Eric Lu, Founder & CEO

BACCA.AI

Content

Text Link

Successfully sent,
we will contact you soon

Oops! Something went wrong while submitting the form.

Link copied to clipboard