Deduplicating software bugs with Machine Learning at Google
I’ve been working at Google for about three and a half years. Back in 2021, we were facing a little problem. A team we were working with had too many duplicate bugs.
How did they get in this situation?
Well, Google runs automated tests. Thousands of them. Millions of them. Some run automatically every time there’s code changes in a developer workspace, some when engineers send a code review, some after they checkin, some as code reaches a deployment stage, some on a cadence, on demand or triggered by an event. Because we run so many tests, we don’t always have the human cycles to analyze the results and manually log bugs, so we’ve created a number of bug auto-filers. An auto-filer analyzes the results of a test run, and logs a bug automatically. Some of our test execution frameworks offer auto-filers as a feature.
Bug auto-filers are great in that they reduce a lot of toil and ensure failures are not ignored. But eventually a human does need to be involved and this is where the process can bottleneck. Somebody needs to look at the bugs and decide whether they are a real problem or not. In other words, we created bots to save human toil, that generated human toil at the backend!
Sometimes, a single root cause can trigger many tests to fail: if the auto-filer doesn’t know that this is the same actual cause, each test failure ends in its own filed bug, therefore having duplicated bugs.
Thinking more broadly, we realized bug auto-filers were not the only thing contributing to duplicated bugs. If you were a platform team providing a service consumed by dozens, hundreds, or thousands of services within the company, any engineer from any of those teams could independently log a bug about the same issue.
So we started thinking: are there other teams at Google facing toil like this due to duplicate bugs?
What dawned on us was that while the toil was relatively small for each individual team (mostly just an annoyance), it could aggregate to a very large number at Google scale. An engineer wasting 5 minutes investigating a bug that was a duplicate, multiplied by the thousands of duplicated bugs, added to years of toil.
But could we quantify it to decide if there was a sound business case to go fix this problem?
Google has democratized access to all kinds of interesting data from within the company, so I was able to query the data from the bug system and gather some statistics for the entire company. Without going into specifics, it was a lot. Software engineers are an extremely expensive, valuable and limited resource, and your day-to-day life has dozens, if not hundreds, of little annoyances, inefficiencies, or silly toil. When you have 100,000+ engineers, a little toil aggregates to a big opportunity. More importantly, little annoyances over a period of time makes us mortals let our guard down, creating a huge tech debt backlog or in worst case potential for vulnerabilities.
How could we tackle this opportunity? We could replicate (in software) some of the processes and reasoning that humans actually go through to compare two bugs and decide whether they are duplicate or not. For the most part, humans use readily available cues to increase their confidence that bugs are duplicate. If the bugs share the same path (or more broadly, how much of the path they share), if they have similar titles, if they have similar stack traces, if they were filed at around the same time, and so forth. A human looks at a whole bunch of dimensions, and based on perceived proximity between the 2 bugs in those dimensions they make a final decision aggregating all the confidence.
This was perfect for Machine Learning! Have bots deal with the toil created by bots!
There had been prior efforts to apply simpler rule based systems but they didn’t manage to build very much momentum because it was hard to build rules around all scenarios, and there were too many edge cases. But ML could provide an elegant solution.
At that point, we had identified a problem, we had identified its impact to the company, and we had a general sense that we could apply machine learning to solving it, but we needed to prove that it would work. We pitched it to our Director and he was very kind in giving us a few months of leeway to go investigate it, understand it a bit more, and create a proof of concept.
Our initial thought was that clustering would lend itself well to the problem. This required us to invest in techniques such as K-means or take semi-supervised approaches since we had some ground truth information about duplicates. But since we were all fairly new to machine learning, we started with a simpler approach of using binary classification — where we compare two bugs and determine if they are duplicates or not. Given the existing tooling and automation within Google, we were able to get from concept to solution in a matter of weeks. We came up with a scoring function, to rank all the possible duplicates based on relevance and only pick the ones above a threshold. Bugs were rich in metadata, with bug title and description being obvious candidates, but we also experimented with other pieces of data such as number of duplicates attached to a bug, number of comments and updates, priority, severity, number of incorrect duplicate predictions that the bug was part of, the number of affected users, etc. to rank the identified bugs. It was relatively simple for us to reason with the binary classification model and improve it to a point where we had > 90% precision. We knew that this approach would mean we could not scale across product boundaries — i.e identifying issues that have a cross product impact (meant doing pairwise comparison across every open issue across Google). This however represented a very small fraction of issues and we were willing to make that compromise for speed. When we start with using ML to solve new problems, failing fast is key and this approach allowed us to prove the idea and iterate faster.
Once we proved that initial feasibility, it was time to start thinking about the product shape. Prototyping is one thing, but productizing something is an entirely different problem. We needed to think about the customer experience and how google engineers could actually use it. How could we turn this into an actual product?
Here’s two decisions we made early on:
- Automated bugs or manually filed bugs? We decided to be narrowly focused at first, to prove feasibility in one application rather than attempting to boil the ocean. We chose automatically filed bugs from one specific auto-filer. This simplified the world: since each category of auto-filers had a fixed template, we could train the model on data that followed specific patterns, and so our ML model could have a more stable feature set. We analyzed the historical bug data and picked the auto-filer that had contributed to the most duplicate bugs at Google, going after the single biggest business opportunity. We were aware that there was a risk our solution worked really well for that one specific case, giving us false confidence that it could apply more broadly, but didn’t want to put the cart before the horse here.
- How many teams to target? In the short-term, we started with bugs from just two teams to prove viability. After 3 months, we increased scope to about ten teams within that organization, to prove the models could handle more diverse data. We chose to work deeply with a small number of teams, rather than broadly with a large number of teams, so that we could understand the nuances of the space. We knew that we eventually wanted all of Google on our system, and we couldn’t keep that level of support and red-carpet-onboarding as we scaled, but in engineering productivity I believe you generally need to start with a pet not cattle approach, then develop the processes to scale it to a cattle not pets approach when your product maturity justifies it.
There was one additional major decision that required a lot of careful thinking: Do we take action or suggest action? This came with a tradeoff between toil reduction and risk tolerance.
- Take action: Once our system identified a duplicate with enough confidence, it could close the bug automatically. This eliminated toil altogether but introduced significant risk. Since accuracy would never be 100%, we could let legitimate problems escape to production (false positives). Even if it was an actual duplicate, we could also get into recurring loops where a test fails, we decide it’s a duplicate so we ignore it, and then it continues failing and we continue deciding it’s a duplicate, forever (“snoozing” failures). This leads to wasting execution time and hardware resources on a pointless test without people having any visibility into it. Lastly, if our model became widespread across Google and we were making all the decisions, there was eventually not going to be any valid training data left.
- Suggest action: a more conservative approach was to add a comment to the bug, but don’t take action. This eliminated the risks from the prior option (since a human was making the final decision and could override our system’s opinion) but did not eliminate toil (although it reduced it significantly). Humans still looked at a bug and took the final action, but now since we had added clear information in the bug itself that highlighted suspected duplicate bugs, they could make a much quicker decision.
- Something in between: to achieve a middle ground both in terms of risk tolerance and toil reduction. For example, using the machine learning confidence score to decide whether to (a) just add a comment to the bug, (b) add a comment and mark the bug as duplicate but let a person close, or (c) close automatically. We could use other dimensions to influence our thresholds, such as priority of the bug : require much higher accuracy for p1 bugs than p2 bugs, etc. Lastly we could make this entirely customizable so that teams motivated to reduce human toil could make a deliberate decision to do so at a higher risk of a legitimate issue escaping to production (a “risk dial”).
Being an ML problem, we internally tracked precision and recall. But I wanted a better way to explain business impact to a person who wasn’t necessarily an ML enthusiast. I came up with this little table:
Using TP, TN, FN and FP, rather than precision and recall, gave us a way to focus on how much business impact we were achieving. In terms of toil, if we assumed a human spent 5 minutes of their time investigating a duplicate bug, we could simply multiple TP*5m to figure out how much time we were saving for Google. And in terms of risk, FP gave a clear measure of issues that could escape to production due to our approach.
Then, we could plot all four metrics in a single stacked chart and see how we were improving (or degrading!) over time, for each team. Here’s an example of one of the pilots we ran.
Ultimately, picking between toil reduction and risk tolerance was a decision for each team to make, so we developed a system where teams could turn our bug deduplication engine in a “shadow” mode only, where it didn’t take action, but kept track of TPs, TNs, FNs and FPs and showed the service owner a dashboard with charts like the one above, so that they could chose whether they wanted to use our system or not, and if they were using it, whether they wanted our system to take action or suggest action.
If you want to read more, we published a couple of defensive publications: