How telemetry helped me understand complex human behavior

Do you know how your Customers use your product?

Carlos Arguelles
10 min readApr 14, 2024

How do you measure and articulate the journey that thousands of engineers take to onboard onto your product when your product is a complex suite of dozens of loosely coupled UIs, command line tools, RPC services, daemons and build macros that has evolved organically over two decades at one of the biggest software companies in the world?

In 2021, I was a Senior Staff Engineer at Google, working on the infrastructure that googlers use internally to author and run integration tests for their respective products (I’ve spoken about it in the past in my blog How we use hermetic, ephemeral test environments at Google to reduce flakiness).

When I joined the team, what was immediately clear to me was that it was difficult to onboard onto the suite of products. Google does a lot of things very well, but let me be candid, there isn’t a strong culture around building a great Customer Experience (particularly for internal applications).

I was worried: if I, as the new Technical Lead of the Products, struggled to use my thing, were others struggling too?

I wanted my team to focus on improving the customer experience, particularly around onboarding. This was a new product, and we were hoping to get all googlers to use it, so a poor onboarding experience for 100,000+ highly paid engineers translated to toil and wasted productivity. But in order to convince my team, I needed to gather data to make a compelling case.

In an ideal world, it would be nice for that to look something like“given that each engineer spends X days onboarding, and Y engineers onboard every month, this aggregates to Z years of wasted productivity for the entire company,” which could then support a yearly goal along the lines of “Improve the 90th percentile of the time it takes to onboard onto our suite from A days to B days.”

But we didn’t know what the onboarding time was today. Or how to measure it regularly to observe trends. Or how to even define what “onboarding” actually meant. Our surface had dozens of potential entry points and journeys. Since evolution had not been particularly intentional, some of the pathways didn’t really make a lot of sense… often they were just random things that some engineer (long gone now) had hacked a decade ago. None of us had any clue what common steps people followed. A lot of our tooling didn’t have telemetry. And the tools that did have telemetry emitted the metrics in their own bespoke way, so aggregating a couple of dozen of disparate sources of data and making sense out of it was non-trivial.

This seemed like an ambiguous, quite possibly unsolvable task.

But I didn’t want to let perfection be the enemy of good. I figured there were steps I could take to start disambiguating the space, and at least getting an approximation. I rather have an imperfect approximation (that could give me some insights into actions to take to make it better) than nothing at all.

To help me focus, I wrote down some desired outcomes:

  • Prove that onboarding time is a problem we should fix and convince others
  • Identify specific bottlenecks in the journey that we can remove or reduce and execute on this
  • Measure onboarding time in a consistent manner to prove that what we thought was going to reduce the time actually reduced the time

All the solutions I considered had flaws, but even with flaws they could help me achieve these outcomes. While the dogmatist in me dismissed them with contempt, the pragmatist in me shrugged it off and kept chugging along, unfazed. Amazon, another company where I spent many years, codified this brilliantly in their Leadership Principles by highlighting the tension between “Insist on the Highest Standards” and “Dive Deep” on one side, and “Bias for Action” and “Deliver Results” on the other side. You need to be intentional about where in the spectrum you should be in a given situation.

Even though this was a technical problem, ultimately it was a human problem. I needed to understand, track and quantify how humans behaved. I am a software engineer, so I understand computers more than I understand humans, but I did read some of the literature to learn how others had solved this before me and what lessons I could leverage.

Were there similar problems I had seen in the past? Yes, actually. Back in 2013 or so, I had worked at Amazon in a team called Clickstream. Clickstream parses log files from amazon.com, extracts events (such as “customer X viewed page Y at time Z”), and aggregates and connects these events into journeys at massive scale. This allows Amazon’s retail site to make all kinds of near real-time business decisions to adapt its behavior to optimize for what customers are doing.

What I was trying to do was a smaller, simpler, version of this.

  1. I should be able to programmatically get data for “Customer Events”, places where an engineer at Google had interacted with any of our surfaces of interest. Something in the form of “<Person> <DateTime> <Event>.”
  2. Could I stitch these individual pieces of data together to form a cohesive journey for a particular customer? Something in the form of “Bob started a code review on day A, tried to run the test locally on day B (it failed), asked a question to our support staff on day C, ran the test locally on day D (it passed), put the code changes up for review on day E, and submitted the code on Day F.” That would give us valuable insights into places where Bob had gotten stuck, so we could improve those.
  3. Could I aggregate these individual journeys at scale for tens of thousands of engineers to quantify percentiles on how much time we’re spending (wasting) in the entire company? Something in the form of “After analyzing the behavior of M engineers over the course of N months, we observed a 90th percentile onboarding time of O days and an abandonment rate of P%.”

So I started thinking about what Customer Events were interesting.

If a tree falls in a forest and no one is around to hear it, does it make a sound? No, it doesn’t, the same way that if an engineer writes a test, but does not check it in and does not make sure that test is officially running as part of CI/CD, that test doesn’t exist. I’ve always been opinionated that unless a test is appropriately and consistently gating code changes, it has no impact.

Given that, the most critical Customer Event for my product was when an engineer checked in a working, passing, integration test. I figured that I could start with a crude approximation of “Onboarding” as the time between (1) when we first saw the customer interact with any of our surfaces and (2) when they checked in a code change that created or modified an integration test. In order to checkin a piece of code, you need to put it out there for code review and convince several of your peers that your work is correct, so that gave me some confidence.

I started consuming the data that Google’s code review system emits, looking for code changes that used one of our build macros (you have to use one of our build macros to run our infrastructure). Code reviews provided me with not just one, but several Customer Events:

  • When the code review was started
  • When it was sent for review to peers
  • When it was actually checked in

How much time elapses between them is interesting and gives us insights into the customer journey. Does it take days? Weeks? If it took weeks to create your first code review, we should be worried! Code reviews gave me other interesting metadata too: how many comments were made? Numerous comments could mean a very picky reviewer, a very sloppy reviewee, or a deeper problem with our product usability.

I initially started with just events from code reviews, but as I continued thinking more holistically about the space, I realized that there were dozens of surfaces that our customers touched that gave us interesting data points.

We owned a number of web UIs. I hooked into the telemetry of those apps to emit Customer Events when people visited them. I also hooked into the telemetry emitted by our command line tools so when they were executed they would also emit a customer event.

But there were other surfaces beyond just our codebase:

  • Were engineers asking for help? This was a meaningful Customer Event. It was a signal that they were struggling with our product. Google has a StackOverflow-type thing called YAQS, so I could pull the data emitted by it and look for questions asked about our products. YAQS metadagta gave me multiple events of interest: (1) when a person asked a question, (2) when others answered the question. If there was a long time between when they asked the question and when they got a good answer, that could be a bottleneck.
  • Were engineers reading our documentation and taking our hands-on course? I had personally pushed the org very hard to create high quality education, and I hoped to prove they were valuable. Maybe I could show that people who took the course onboarded xx% faster than people who didn’t.
  • A code review can have a bug associated with it, for tracking purposes. I hooked into Google’s bug system to extract metadata on the date and time that those bugs had been created, commented on by the person, and resolved. Those were interesting events in that they tracked the official beginning and end of the task.

Little by little, I continued adding customer event data sources. I ended up with about 15 different sources of events. Sometimes a tool emitted telemetry, but not in the format I needed, so I wrote data transformations. For the ones that didn’t have telemetry, sometimes I sweet-talked people in the team to add it, other times I did it myself. I anonymized the data for privacy reasons.

After a month or so, I had a few hundred thousand events emitted by ten thousand engineers or so, I figured I should start trying to stitch them together and see what kind of journeys I found.

Some stories stitched together beautifully, just like I expected. This is a customer that took the course, then started a code review, launched the test, it failed, asked a question to the community, figured out the problem, and then submitted the review. Success!

Other times, the events stitched together but to form a sad story. For example, an engineer started a code review, then started another one, then started a third one, asked a question to our community, didn’t get a satisfactory answer, and then we never saw them again. None of the code reviews were submitted. That was a clear case of abandonment. We had lost a potential customer. Maybe they gave up on having integration tests (bad), or maybe they decided to write their own bespoke solution instead of ours (also bad).

I was aware that there could be a huge amount of noise because humans do weird things for sometimes inexplicable reasons. So other times, the events didn’t stitch together so cleanly, so I found myself diving into specific journeys scratching my head, wondering why that person had followed that odd sequence of steps and triple-checking the telemetry I had written.

There were lots of heuristics I had to add to the stitching code to reduce that noise. I had to decide how to account for a customer pausing and resuming a journey. For example, if we saw a bunch of events, then nothing for a month, then a bunch more events, it could be the individual had been on vacation, or was pulled into a higher priority task, so I didn’t want to count the time inactive as onboarding time. Sometimes people took the course as part of a list of things to learn when starting in a new team, but without immediate plans to actually apply the knowledge. And I had to decide how to determine who was in-progress and who had abandoned the journey. If we don’t see any customer events for a month, are they still in-progress? Probably not.

I was also aware that calendar time is not the same as developer time (since people are often multi-tasking). Developer time is a non-trivial thing to calculate, and much more invasive into people’s privacy, but calendar time is much more achievable. And calendar time is still a valid measure because if anything drags on and on, it’s likely to get dropped.

All in all, the data I gathered, and the journeys I stitched with the data, gave me an amazing insight into what customers were doing. I was able to convince others around me that we needed to improve our customer experience significantly, and we shifted resources to address it. We found specific areas to go fix where many customers were getting stuck.

Takeaways

Here’s three major things I’d love you to take away from my little story today:

  1. If something looks too ambiguous to get started or too hard to do correctly, don’t let perfection be the enemy of good. Focus on the outcomes that you want to achieve and pragmatically accept imperfections that will let you achieve the outcomes.
  2. Incremental improvement is a powerful tool. I started with just the data from code reviews to get a crude approximation. Then I incrementally added more data sources to improve the quality and fidelity.
  3. Understand your customer journey. If you don’t, telemetry is your friend: emit customer events and stitch them together to learn what steps your customers took to navigate your product. You’ll learn invaluable lessons about what people are actually doing with your product!

--

--

Carlos Arguelles

Hi! I'm a Senior Principal Engineer (L8) at Amazon. In the last 26 years, I've worked at Google and Microsoft as well.