Bad metrics drive out good
Impact. The magic word Big Tech uses to describe core job expectations, with Google going so far as to rename its rating descriptors based on impact —with “Needs Improvement”, “Meets Expectations” and so on replaced by “Not Enough Impact”, “Moderate Impact”, “Significant Impact”, “Outstanding Impact” and “Transformational Impact”.
There’s a lot of sense to this. While it’s easier and more deterministic to reward people for activity or tenure, you get better outcomes if you recognise and reward actual business results. Working in an environment which does this well teaches you to understand the problems you’re solving rather than just doing what you’re told, to challenge on the ‘why’, and to prioritise.
That said, one of the potential pitfalls is when “impact” means “metrics”— and especially when it means “any metric that makes you look halfway good”. When everyone’s direct incentives are based on metrics, caution must be exercised.
Gresham’s Law: Bad Money Drives Out Good
Gresham’s Law was named for English financier Sir Thomas Gresham* (1519–1579), although the concept appears throughout history and was most notably popularised in Medieval Europe by Copernicus. It states: “bad money drives out good”.
Gresham was specifically concerned with currency debasement. Here, “good” money is made of a precious metal (typically gold or silver) with the value of the metal being equivalent to the face value of the coin, while “bad” money was made cheaply with base metals and alloys. Legal tender laws decreed that both types of coin were worth the same when spent, and so people would hoard the “good” coins while spending the “bad”. Eventually, this drives the “good” ones out of circulation entirely.
“Bad” and “good” are of course subjective terms; what we should really be thinking of is survival advantage driving circulation. In the above example the advantage is that the debased currency is cheaper; in a hyperinflating economy, the advantage might be for US dollars rather than local currency because they’re stable and retain their value. More dangerously, if excessive risk taking or fraud provide higher perceived returns — in the short term, at least — they will crowd out sound investment strategy (see: everything from Charles Ponzi to cryptocurrencies).
There is, of course, a hand on the scale here. Currency debasement only matters because legal tender laws dictate that the debased currency is worth the same as the “good” one while in circulation, inflation can be managed with sound central banking (keeping reserves, controlling interest rates and such) and the cost of risk taking or fraud can be increased with regulatory or criminal oversight.
The key insight is to look for those survival advantages with the expectation that they will drive people’s behaviour and hence the economic outcome — then, if that produces “bad” winners, you know where to apply a hand to the scale.
* no relation so far as I know, although the name is why it stuck with me!
Survival Advantages of Bad Metrics
So what does this have to do with metrics? We said earlier that tech companies like to talk about Impact, and the natural impulse is to quantify it. Metrics aren’t free, though —they can be hard to instrument, to analyse and to move. These are all areas where a bad metric — one which doesn’t actually tell you whether you’re succeeding as a business — can creep in with a survival advantage.
Good metrics can be hard to instrument. Compare pageviews on a website with a revenue-generating event like a sale; the pageview can be done using client-side tooling like Google Analytics or from the most basic of server logs, where the sale will have to have code written (plus will need to capture more information). Or consider counting your number of unique users — how do you know the person who visited your website from their desktop and their phone is the same person? There are ways to tackle these problems, but it’s a lot cheaper to measure something else.
Good metrics can be hard to analyse. Your actual business metrics (revenue, sales, etc) are likely influenced by many factors rather than the one you’re changing with a given project, some of which you control and some of which you don’t (for instance, market or economic conditions). You can A/B test, but setting A/B testing with proper splits is often a significant investment across multiple functions. You also typically need to look at suites of metrics rather than just one; how important is sales volume as compared to customer happiness, or retention, or margin? This isn’t straightforward.
Good metrics can make it harder to measure change. Another pain point with A/B testing is having a large enough sample size to draw conclusions, since data points for your key business outcomes are likely at the bottom of a funnel — for example, on an e-commerce website you’ll have a lot more page views than sales.
Good metrics can be hard to move at all. Keeping with our e-commerce website example, I could get a lot of pageviews using ads to buy traffic or by adding more steps to the checkout flow or many other ideas — but if I have to increase sales in an ROI-positive way, that’s simply a harder problem to solve.
Addressing all of these by using proxy metrics or less rigorous analysis can be reasonable, if done consciously. The trap is that if you’re trying to reward impact, and you’re doing so by asking for metrics, and you treat all metrics as being the same when you’re doing the rewarding — those survival advantages kick in. The easy way to get that reward is to pick a metric you know you can measure and influence, regardless of whether it’s meaningful. You might even pick the metric after the fact (the Texas Sharpshooter Fallacy). You’re directly incentivising bad metrics, and they will drive out the good ones.
Tactics to disincentivize bad metrics
Have a set of blessed metrics for your team. Everyone should be clear on what success looks like from a business perspective. Make sure to include a balance of business indicators like revenue with check metrics covering things like customer satisfaction and product quality; this is to ensure that primary metric growth is sustainable and doesn’t go against your company values. This also means that, as an engineering leader, you have to understand the business well enough to ensure these metrics are appropriate to your strategy and revenue flows — you can’t fob these things off as “business” or “product” problems.
Agree on and set lead metrics where there is a proven or highly plausible correlation with your blessed metrics. Don’t assume that correlation exists — apply scrutiny as leaders, and ideally get a perspective from a data analyst who is not rewarded for changes in those metrics as a check and balance.
Agree on and set confidence targets. You do want rigour, but if you’re making many decisions you only need a majority to be directionally correct and you need to make them quickly — maybe a 0.95 p-value is overkill, and being 80% confident in any individual decision is fine. Similarly, if it gets you moving faster, it might be OK for the correlation between your leading and lagging metrics to be a bit looser. Don’t wait to act until you have 100% confidence, because then you’ll never get anything done — but get that set out up-front so it doesn’t become a rationalisation.
Make sure both your metrics, hypotheses and confidence targets are agreed and communicated up front. You want to be clear on what your objectives are and what success looks like, so that people can plan their work around how to move them and avoid post-hoc rationalisation. This goes for how you’ll measure them, too; even if running a full-blown A/B test is infeasible, everyone should declare their hypotheses up-front and not allow for moving the goalposts afterwards. That doesn’t mean don’t look at other metrics than what you’re testing, but only for generating your next hypotheses and never for claiming success. Oh, and if you do run an A/B test, try using Bayesian methodology, if only because it’s resistant to peeking and p-hacking.
Reward both success in moving metrics and transparency when a project fails to move metrics. You’ve never going to succeed with everything, and if you did that would strongly indicate that there was some sandbagging or gaming of the metrics going on. Don’t allow post-hoc rationalisation that actually, this other metric is what really matters and the project was a success — but don’t reach for blame, either. If you see people holding themselves accountable for the miss, learning what they can and changing course appropriately — that’s predictive of long term success, so make sure you reward that too!