The Hidden Cost of Success
We need a new way to operate software
Say goodbye to getting paged at 3 a.m. to fix broken software. It’s time we take back our nights and weekends.Advancements in cloud computing, artificial intelligence, and machine learning have unlocked tremendous innovation and a new age of software development. Yet, the way we operate software is stuck in the past—manual and human-dependent.Our team is here to change that. Stop worrying about broken software and spend more time innovating.
Building software is hard. operating it is harder
Many think work ends after code ships. They’re wrong. Engineers spend more time operating software than creating it.Writing perfect code isn’t the answer, it’s not even realistic.Modern software systems are increasingly interconnected and complex. They take a life of their own; an amalgamation of various libraries, services, APIs, and other components that must seamlessly integrate to work.There are plenty of examples where a small misconfiguration or a seemingly innocuous library update caused an entire system to crash, impacting thousands, if not millions, of people (company names withheld to protect the innocent).Even worse, your system can break due to factors outside your control. Reliance on third-party software speeds up development while adding another layer of operational risk.We can’t predict when an incident will occur or how often it will happen. Incidents don’t respect boundaries or out-of-office notifications. They are the ultimate party crasher. And when they break–a human must wake up to fix them.Sites like downdetector.com show how pervasive, widespread, and damaging incidents can be. Well-known and important companies always make the list.What can’t be seen are all the other (less severe) incidents that constantly occur. While they have less impact on end users, they are equally disruptive to those tasked to fix them.For many, the business stops when software breaks. The stakes couldn’t be higher.
Humans shouldn’t be the bottleneck
Every modern software organization has a few “operational gurus” who are essential to the business. They are the most senior engineers who have seen a lot, done a lot, and fixed a lot.Everyone turns to them for help. They know their systems' capabilities and limitations well. They know key decisions that were made, many of them years ago. They know where the bodies are buried.This knowledge is painstakingly acquired and not easily transferred from one human to the next. It’s unevenly distributed by time and medium. Prior decisions are recorded in emails, code comments, chat posts, internal documentation, and more.Naturally, when an incident occurs, the most experienced people fix it. It’s easier and faster that way.The problem is that it creates a vicious cycle: senior engineers become overwhelmed and overworked, causing them to leave, while junior engineers don’t get the opportunity to learn.We must break this vicious cycle and move the operational burden from humans to machines.
AI can be your ally
Recent advances in AI and LLMs allow us to create a new way to operate and fix software. AI can easily learn from a large amount of information, old and new, structured and unstructured (something humans are notoriously bad at).When AI is targeted at a specific body of work, like operating and fixing software, a new paradigm emerges.With proper training, AI can:
- monitor software systems at all times
- identify when there are issues
- triage, analyze, and interpret relevant data to find the root cause
- replicate the problem and propose solutions
- deploy changes and confirm the fix
- notify relevant stakeholders
Let bacca.ai cover yournight shift
We’re creating the world's first AI site reliability engineer to own your on-call shift. Always available and ready to respond at a moment's notice. The AI engineer is designed to identify, triage, and resolve software incidents. All on its own.Today, BACCA behaves like any other team member, and you’ll onboard them like anyone else. It’s trained to look at prior incidents and all remediation actions that were taken. BACCA will review everything you grant access to (e.g., chat and email history, internal documentation, code bases, …), and develop an understanding of how your system is designed to operate.When an incident occurs, BACCA will triage the issue like an experienced engineer would. It will determine severity based on the size of the business impact, form an investigation plan based on prior experience, investigate the root cause by reviewing logs/metrics/events/traces, reason through the problem, and propose solutions based on its understanding of your system.As BACCA builds trust with you over time, it will earn the right to perform mitigation actions and fix the problem on its own.
Nobody had our backs. bacca has yours
My two cofounders and I are devoted to this problem because we’ve experienced it thousands of times. For decades, we operated infrastructure for the world's largest social platforms and high-stakes financial institutions (Snapchat, TikTok, Google, and Bloomberg).We were the ‘operational gurus’ who knew how the system worked. We were called in to fix software at all hours of the day, losing sleep and precious time with our families. This prevented us from working on more valuable, rewarding, and interesting things.While we’ll never get back the time we lost, we don’t want the same to happen to you. The world needs a new way to operate software.Stop worrying about broken software, and sleep well knowing that BACCA has your back.
Eric Lu, Founder & CEO
BACCA.AI
we will contact you soon