You got peanut butter in my chocolate!

The story of marrying a Profiler and a Load Generator at Amazon

Carlos Arguelles
Nerd For Tech

--

For the last 5 years, I’ve lived in a quaint town by the sea called Edmonds. Only twenty minutes north of Seattle, it very much retains its traditional small-town charm. On a sunny day, crabby old men, couples and laughing children alike stroll the sandy beach, enjoying the breathtaking views of the Olympic mountains across the water and the occasional pod of orcas passing by, dodging sailboats, ferries, cruise ships and cargo ships. Mischievous teenagers routinely add soap to the fountain on 5th and Main in the middle of the night, so that entire corner is a bubbly mess in the mornings. Picturesque trains bound to California or Canada cut thru town blowing their whistles. In the spring the restaurants downtown ooze onto the sidewalks for a more European feel.

Last month, my wife and I, along with a couple of hundred brave souls, battled a bitterly cold and rainy morning to celebrate the grand opening of our Waterfront Center, a gathering place for all the community. The mayor gave a speech, Native Americans performed a number of traditional songs to bless the grounds, and our local celebrity, Rick Steves, gave a long, impassioned speech about having a space for our community to meet and get to know each other. The building is gorgeous, and the setting is spectacular.

As I listened to Rick, giddy like he was talking about an Italian piazza in one of his TV shows, I thought about how this building had been built.

When I was younger, I sort of assumed that the government was responsible for “building shared things” so things like these were probably envisioned by some nebulous “central authority,” and they “just happened” after that.

Yet this building was entirely envisioned and built by the community, for the community. There wasn’t a singular person in the government that said, “Edmonds should have a community center.” It was dozens of ordinary citizens discussing this over coffee, wine or beer for years, until it gained momentum. Then, they volunteered their time, socializing the idea, convincing others, and slowly refining the concept. Ordinary citizens were inspired and donated their money to fund the construction, and ordinary citizens donated their skills to build what turned out to be a feat of engineering.

How is this related to my life as a software engineer?

I think, too many times, junior engineers assume projects just get handed down from some “centralized authority” whose job is to decide what needs to get done for the team. This overlooks the fact that they too could and should have a seat at the table and a say in what they build. They should be constantly rewriting their job description (and I blogged about that here). So many pieces of software that exist in a company like Amazon, Microsoft or Google, were built by ordinary engineers who just had a thought, and had the fortitude to convince others to help make it happen.

My main story today is about how two guys having dinner decided to marry a profiler and a load generator at Amazon, sketched the idea on a napkin, and made it happen two years later.

An idea written on a napkin…

Every year Amazon used to treat its Principal Engineers to 3 days of pampering and networking at a resort nestled in the mountains, about two hours east of our Seattle headquarters. We would pause whatever we were doing, drive over the mountain passes, and spend three days getting to know each other over coffee, whiskey, good food and activities. Today Amazon has thousands of Principals so they can’t do this anymore, but in 2017 we were about 500 or so, so we all knew each other.

One of the nights, I ended up sitting at a dinner table with my buddy Tim, who was a Senior Principal in the amazon.com side of the company. Tim had created a really cool internal technology to do profiling. It would eventually be externalized and become Amazon CodeGuru, but back then it was still internal. A profiler collects runtime performance data from a system under test and provides different visualizations of the data to help you identify how much time is consumed by each line of code (to find the most expensive ones). Tim’s work was certainly in my radar, and mine was in his. I had created the load and performance testing platform that Amazon used to ensure thousands of its services were ready to handle peak traffic (“TPSGenerator”). We were both obsessed with finding inefficiencies and saving the company a lot of money in terms of wasted hardware.

Tim proceeded to explain how one of his customers had used Profiler to identify a bottleneck in an extremely high throughput service. With the help of Profiler’s flame graph, this engineer had realized that an object that had very expensive initialization cost was being created over and over again in a loop. It was wasteful. The object was thread safe, so the engineer pulled the initialization out the loop and instantiated it once as a static member of the class. The code change was so trivial — literally 2 lines of code. But the hardware utilization dashboards for the service showed an astonishing drop after the change. Tim smiled with pride: “Over the course of the next 12 months, the savings for this 2-line change are NN million dollars.”

A sample flamegraph (credit)

The conversation shifted to TPSGenerator, so it was my turn. I brought up my own example of a customer that had used TPSGenerator to realize they could switch the hardware their service used to a more appropriate one. This engineer had spun off many test environments, each with a different hardware configuration, and had run the exact same load tests against each of the environments to identify which one performed best. It was my turn to smile with pride, “Over the course of the next 12 months, the savings for this change are NN million dollars” I mentioned.

We both sat quietly for a minute, enjoying a couple of beers after dinner and our shared friendship.

“My customers should be using Profiler!” I said.

“My customers should be using TPSGenerator!” he responded.

That conversation reminded me of a 1980s ad from Hershey about Reese’s peanut butter cups. There are two people walking down the street opposite to each other, one eating a chocolate bar and one eating peanut butter, not paying any attention to where they’re going. They collide. “Hey, you got chocolate in my peanut butter,” one complains. The other person replies, annoyed: “Hey, you got peanut butter in my chocolate!” Then they try the combination. “Mmmmm…,” they smile. The ad concludes with “Two great tastes that taste great together!”

Which led to an epiphany. As a load generator, TPSGenerator was a powerful tool for driving a system under test towards a bottlenecked state, but once you got there, it couldn’t tell you why it was bottlenecked. Profiler was a powerful way to understand why a system was bottlenecked.

They needed to integrate seamlessly.

“Two great tools that work great together.”

Of course we had to call that feature Reeses!

I described the architecture of TPSGenerator, and Tim described the architecture of Profiler. We peppered each other with technical questions. We grabbed a couple more beers and started to sketch the technical design of how they could integrate together on a napkin, since it was the only piece of paper we had. Over the course of the next hour, that napkin became a critical artifact. We had boxes and arrows everywhere, and after a couple of beers, I was feeling pretty good about it. I folded the napkin neatly, put it in my pocket, and called it a night.

A long wait…

The next morning, we all headed back to Seattle. It was only 2 days later I found that napkin in my pocket as I was about to wash my jeans and remembered our long brainstorming session, so I brought it up with the team responsible for TPSGenerator (by then, I had taken a step back from day-to-day management of the product, in a deliberate effort to create space for the next generation of leaders to step up and own the product). They were excited by the idea, and it went in the backlog.

But for the next year or two, integrating TPSGenerator and Profiler continued to be just wishful thinking. Both teams were barely staying afloat with their current operational load and p0 features, so it was hard to justify prioritizing a thing that could potentially save millions of dollars, but that nobody was actually asking for, when we had a never-ending list of high priority requests for things that were actually blocking customers.

Occasionally I would run into Tim on campus and the conversation would always lead to us lamenting how we couldn’t in earnest justify our teams spending their cycles on this thing that we both knew could be hugely impactful.

The week that changed everything…

Fast forward 18 months… Most critical-path services on the amazon.com side of the business tend to do load and performance testing to ensure they can withstand peak traffic and they are appropriately scaled at least twice a year, once before Amazon Prime Day (in June) and once before Black Friday and Cybermonday (in November). To build org-wide buzz and excitement, we had a specific week (“Performance Test Week”) when we asked all teams to do this. This also built some amount of peer pressure: all the cool teams around you were load testing that week so you should too! David Treadwell, SVP of eCommerce, tasked Priyanka with the complex cat-herding logistics, so she and I grabbed coffee to brainstorm ideas. I was in AWS, which is an entirely different business unit from amazon.com, but I was always opinionated about load and performance testing so I made an effort to stick my nose in initiatives like these anywhere in the company.

Priyanka and I chatted constantly. This was not a cheap endeavor. We had the support of an SVP to mandate that hundreds of teams spend a significant amount of engineering time and hardware resources —easily a multimillion dollar investment. We needed to be thoughtful about providing teams the maximum value for their time.

I wasn’t happy with our plans. They felt a bit boring, a bit of the same. We had done the exact same thing, two years in a row. I wanted to do something different, special, unique.

Hmmm… I was going to have the undivided attention of thousands of engineers and the direct support of a Senior VP…

This was exactly the catalyst I needed to make Reeses happen.

Working backwards: Prime Day was in June, so Performance Test Week was in February, to give teams time to address any issues they may find. It was December now. December is a great month to work on pet projects you’ve had in your back-burner all year, because a lot of people take time off so you can’t schedule a lot of business critical deadlines anyways.

Having a firm deadline energized me and gave me a sense of urgency (I’ve written about deadlines being a positive catalyst in the past here). I had two months to convince and inspire two entirely different teams (TPSGenerator and Profiler) to work together to deliver a feature I had promised to an SVP.

I found that old napkin with the design, wrote a proper technical design document, and got the architecture signed off by both teams. Then, I grabbed some time with Tim’s Technical Lead, Fernando, because I needed to sweet-talk him into creating a couple of new APIs in their RPC service for TPSGenerator to call. He too was excited that this was going to be the main star of Performance Test Week, so he made his changes happen in no time. And then I wrote the code for TPSGenerator to tell Profiler to start and stop profiling a system under test, and to fetch the flamegraph for that timeframe so that I could embed it in the load test report itself. Customers just had to tell TPSGenerator which environment they wanted profiled and they would magically get an interactive flamegraph to explore.

At the end of the day, we did get the feature working in time, and it was the focus of Performance Test Week. Our engineers learned a whole bunch about their systems by load testing them to their breaking point then deep diving into the corresponding profile to understand the actual bottlenecks. And, for those of you who enjoy a long and dry legal document, Tim, Priyanka, Fernando and I filed a patent with some of our ideas that is available externally.

We filed a patent with some of our ideas!

Some final thoughts

  • Don’t just wait for work to be assigned to you: be opinionated about what your product should do, be an Owner, have a seat at the table in making those things happen. Your level doesn’t matter. Your title doesn’t matter. The amount of time you’ve been at the company doesn’t matter. You can be a Leader regardless of those superficial indicators: being a leader is about behavior. Every single one of the volunteers who built the Edmonds Community Center was a Leader.
  • Leadership involves relentlessly talking to people around you, not just being a dark corner writing code by yourself. Often, you need to align multiple teams behind an initiative, rather than work in isolation.
  • Sometimes you have a great idea but it’s just not the right time, for any number of reasons. It took two years for what Tim and I wrote on a napkin during dinner to become a reality.
  • Have your eyes open to recognize when the right time has come, so that you can seize the opportunity then.

--

--

Carlos Arguelles
Nerd For Tech

Hi! I'm a Senior Principal Engineer (L8) at Amazon. In the last 26 years, I've worked at Google and Microsoft as well.