When sharks chew on network cables at Google

The complexity behind eradicating test flakiness

Carlos Arguelles
10 min readJul 22, 2023

If Medium puts this content behind a paywall, you can also view it here (LinkedIn).

illustration by Bailey McGinn

For the last 25 years, I’ve been fascinated by engineering productivity at large software companies. I have been a Senior Staff Engineer at Google for 3 years, focusing on Developer Infrastructure for Integration Testing. Prior to that, I worked at Amazon for 11 years, where I was a Principal Engineer in the Developer Tools organization. We were responsible for the internal tooling for code repositories, code reviews, local development, software builds, continuous deployment, unit and integration tests, canaries, and load and performance testing. And prior to that, I worked at Microsoft for 11 years, where I was an Engineering Lead in various parts of Office and Windows in the nineties. This hands-on experience at Google, Amazon and Microsoft exposed me to how hundreds of thousands of developers write, review, test and deploy code at large scale. What’s most interesting to me is how at these large companies, little inefficiencies can aggregate to millions, hundreds of millions of dollars of productivity lost or wasted hardware resources, and missed business opportunities. I obsess about how to make engineers’ lives better, remove toil, improve efficiency, and raise the bar in engineering and operational excellence.

One of the most annoying little problems that I’ve faced in the last two and a half decades in the industry is flaky tests: tests that pass most of the time, but inexplicably fail sometimes, seemingly at random. You check-in (or attempt to check-in) a piece of code, then twenty minutes later get a notification that a test suite failed, and have to drop everything you’re doing to investigate the failure. Eventually, you conclude that it really doesn’t seem to be because of anything you did, so you resort to the oldest trick in the book: re-run the test. This time, it passes, and you think: “ah, a flaky test, again…oh well.” And move on to more important things. This doesn’t happen consistently, and it doesn’t happen with enough frequency to compel you or any of your co-workers to spend a day of your life fixing it, but if you stop and do the math, it probably happens enough times throughout the course of a year to aggregate to a rather depressing amount of wasted hours that you could have done something more valuable with. Perhaps more importantly, the fact that it happens means that confidence in those tests is slowly eroding. As those tests lose credibility, they lose value: people do not take them seriously anymore and learn to just ignore the failures, which could be masking a real problem with your production environment.

What are the sources of flakiness in tests? I would put them in 3 buckets:

  • Bucket#1: Your test code is flaky
  • Bucket#2: Your production code is flaky
  • Bucket#3: Everything else

For Buckets#1 and #2 I’m referring to things like race conditions, common multithreading problems, and non-determinism. But hey you own your test code, and you own your production code, so you own your destiny. Go fix them!

What bothers me the most is Bucket#3, because that’s a big catch-all of things that are generally outside your control, so it feels a bit helpless. These are infrastructure issues you yourself cannot fix. They’re generally not your problem to fix. At Google, since I am in the organization that owns the infra that thousands of googlers use to run their integration tests, they’re very much our problem to fix.

I have to admit that earlier in my career, I sort of naively assumed that most (if not all) test flakiness would be because of Bucket#1 or #2. But I’ve come to the realization that there’s more nuance that that, and Bucket#3 is enormous.

I’ve also realized that there’s a huge amount of complexity in the problem space. There’s layers of complexity that peel off like an onion. With this article today, I won’t be offering any silver bullet to fix this, but I want to explain why it’s hard and expose some of this complexity.

To understand the sources of flakiness in this Bucket, let’s take a step back and look at unit tests, and then think about how integration tests differ.

Here’s a pop quiz: What do these two pictures have in common?

What they have in common is: many unit tests, zero integration tests!

So that we are in sync in terminology, here’s some definitions from this article:

  • Unit Testing is the process of testing individual components or units of code in isolation.
  • Integration Testing is the process of testing how different units or modules of code interact and work together. And System Integration Testing is a type of integration testing that focuses on testing the entire system or application as a whole.

Unit Tests are pretty simple in terms of infrastructure. There’s some sort of Production Code Under Test, and some sort of Test Code. They both tend to be instantiated and executed in the same machine, in the same process, often even in the same thread (unless you’re specifically testing asynchronous code or highly multithreaded code). This means that there’s not a lot of moving parts that can go wrong. Flakiness in unit tests tend to be because of Buckets#1 and #2, rarely because of Bucket#3.

Integration Tests are different because you generally have to deploy that Production Code Under Test to some sort of server. At Google, we love our Three Letter Acronyms, so we call this “SUT” (System Under Test). That SUT doesn’t necessarily live in just one machine, one process, one thread. It’s a distributed system, so it may need to be deployed and executed in many machines in data centers. And your Test Code will likely live in yet another machine, somewhere else in the cloud.

The distributed nature of systems changes everything.

Dependencies are the bane of my existence

For starters, unless your SUT is fully self-sufficient, it’ll have dependencies. Lots of dependencies.

And the problem is worse than you initially thought, because those dependencies will have dependencies, which in turn will have dependencies, which will have dependencies. Welcome to the problem of Transitive Dependencies! Your dependency graph explodes quite quickly.

Integration tests often fail because of flaky dependencies.

Here’s another pop quiz for you. Imagine your integration tests are absolutely flawless, and your production code is absolutely flawless. In this fantasy world, they are both 100% reliable (well done!). This means there should be no flakiness from Bucket#1 or Bucket#2, the only way these tests will be flaky is from Bucket#3. You have just three dependencies, which have 99%, 95% and 92% reliability. What’s the overall reliability of your tests?

It’s a simple mathematical formula. It’s the Pi (Product Notation) of the reliabilities of any dependency that can cause your test to fail if it’s down or otherwise broken. So at best the reliability of your tests will be 99% * 95% * 92% = 87%.

Once you realize how that math works, you understand that even a little unreliability in each dependency does compound to a lot of unreliability for you.

What kind of awful dependency only has 92% reliability, might you ask? Well it’s often the case that your test environment is not allowed to call the production environment of your dependencies, because the call is not idempotent (that is, it changes state in your dependency). In that case, you’re forced to call a test environment of your dependency, which will likely not be nearly as stable as the production environment of your dependency.

What Happens when a Shark Eats Your Network Cable

Back in the nineties, Deutsch and other fellows at Sun Microsystems introduced the fallacies of distributed computing. The first three are:

  • The network is reliable
  • Latency is zero
  • Bandwidth is infinite

These three in particular matter to Bucket#3 of Test Flakiness a great deal. In an integration test, all the communication between your tests and your SUT, or between your SUT and its dependencies, is likely happening over the network, sometimes across the country, sometimes across the world. What happens when a shark eats your network cable?

You don’t believe me? Here’s a shark actually eating (or attempting to eat) a network cable:

Any flakiness in the network translates directly into flakiness in your tests. Even beyond just flakiness, this communication is slow, so your tests take a lot longer than they should, simply because they’re just sitting idle waiting for a network call to finish. The speed of light as the crow flies from San Francisco to New York is 16ms, but more realistically you’re looking at 60–70ms communication time between East Coast and West Coast. This doesn’t seem like much but if you have dozens or hundreds of network calls for each test, and thousands of tests, it does add up.

System State is also the bane of my existence

Unless you are lucky enough to be testing a stateless system, your test needs the SUT to hold some sort of state, so that the responses from calls to the SUT are meaningful. For example, if you’re testing an ATM, you might need to ensure that the SUT has some dummy accounts and balances, so that when you call GetBalance(“Carlos”) it returns the number you were expecting. This is the task of seeding data into an SUT, generally into whatever database is part of your architecture.

Data seeding is itself a complex problem. Do you seed your data before your test suite starts, or does each test feed it the data it needs? Either way, that means that here’s yet another thing that can fail, even if your tests themselves are perfectly reliable. Some tests for some of our more complex system integrations require gigabytes of data seeded into a database. That adds runtime and complexity to test execution… and another failure point.

If your test seeds data before it runs, and changes the state of the SUT, does it clean up after it’s done? Does it restore the system to whatever it was before it ran, so that the next test finds a clean system? Test X could fail because Test X-n sloppily left some temporary state. This technically falls in Bucket#1 (go fix your test code!), but it’s often perceived as being in Bucket#3, simply because state is a feature of your test environment infrastructure.

I was introduced to this problem back in the nineties, when I was working at Microsoft. I was the Test Lead responsible for signing off all grammar checkers and spell checkers in Microsoft Office XP. I set up a cron job to run millions of sentences through our proofing tools daily build, to make sure they didn’t crash. And every morning when I got to work and looked at the results from the night before, there was a failure somewhere, with a report that said, “Sentence ‘xyz’ crashed the Spanish grammar checker” for example. I would then manually run sentence ‘xyz’ through the Spanish grammar checker to reproduce the problem before I logged a bug, and it wouldn’t crash. So annoying! What I eventually realized was that perhaps thousands of sentences before sentence xyz, some other sentence had corrupted a bit of memory somewhere (this was old school C code, notorious for things like buffer overflows and off-by-one errors on array accesses), but not enough to crash the entire system. Just enough so that much later on, when the grammar checker got to sentence ‘xyz’ it ran into the corrupted memory. The real problem was that the state had been corrupted prior to my test, but it wasn’t immediately obvious (this one technically falls in Bucket#2, because it was flaky production code). I ended up writing a complex debugging system to do binary search in the millions of sentences in my corpora, to identify the smallest combination of sentences that corrupted state. It was a fun computer science problem to tackle, but also a real pain.

Even in a world where every test cleans up perfectly after itself, state also introduces the challenge of test concurrency. Can you run multiple tests at the same time? If so, is there any risk of them clashing? If you have a static, long-lived test environment that could have multiple test suites running simultaneously, you could end up with unexpected results, impossible to reproduce.

Is there a solution to all this madness?

There’s no silver bullet for integration test flakiness. It’s plagued our industry for decades, even at companies like Google, Amazon or Microsoft. But there are a number of things you can do. At Google, we are using hermetic, ephemeral test environments to alleviate this problem. This in turn has brought its own set of problems that we had to address, as well as additional complexity to our infrastructure, but has brought definite benefits. That story is here.

--

--

Carlos Arguelles

Hi! I'm a Senior Principal Engineer (L8) at Amazon. In the last 26 years, I've worked at Google and Microsoft as well.