19 Tweets 5 reads Mar 26, 2022
1) Lies, Damned Lies, and Statistics:
My uneasy relationship with data
2) I went to MIT, became a quant trader, and then a fintech founder.
Outside of work, I'm an Effective Altruist: what matters is maximizing the amount of positive impact you can have.
So you'd think that I'd love data.
3) But the truth is that I think most people misuse, and overuse, statistics.
So much so, that many people would be better off ignoring data than what they're currently doing.
I think it took me a while to come to terms with this.
4) Some examples:
a) Bob is running a consumer fintech company. He studies the multiples of exchange fees and B2B subscription fees; he finds that they're 20x and 80x, respectively.
So he decides against building a mobile interface, and focuses on being a B2B liquidity source.
5)
b) Alice is at a VC firm. She does a study of the correlation between employee count and market cap for their portfolio companies. Controlling for lots of other factors, it's +75%.
In the next round, she mostly funds companies rapidly expanding their headcount.
6)
c) Zed is trying to decide whether to do a superbowl commercial, or a Facebook ad.
They look at impressions per dollar, and decide the latter is cheaper; so they forgo the game.
7) Each of these scenarios are a bit different, and I don't necessarily know what the right answer is in all of them.
But in each case, the statistics were (a) reasonable, (b) correct, and (c) net harmful to their decision making process.
8) The key insight: you're not choosing between looking at statistics or acing randomly.
You have a prior coming in: based on your intuition and critical thinking.
The question is whether data is more or less useful than your priors, and whether you combine them well.
9) In Bob's case, his data is technically correct!
But there are two core issues:
a) his revenue might not be the same in both cases; maybe the mobile app makes more than 4x as much revenue as the B2B product.
b) also: valuation isn't all that matters! I'd prefer earnings.
10) Bob probably would have been better just saying "let's build the business that seems the best" and ignoring valuation.
In Alice's case, her data is probably being misinterpreted.
11) Yes, there is a positive correlation between having 10k employees and being successful:
You can only hire 10k employees if you've done well.
So there's a correlation, but the direction of causation is probably wrong.
12) And how about Zed?
Well, what, in the end, is an impression?
One of the important properties about superbowl ads: they're talked about again and again and again, in lots of places that are hard to track.
The direct views significantly underestimate it's impact.
13) And in this case, a simple gut check might have made Zed realize that _obviously_ superbowl ads have large impact, and a lot of that is the chatter.
So there are lots of ways to use data poorly.
That doesn't make it useless--there are also lots of ways to use it well!
14) But if you do a mediocre job of using data, it just adds noise which distracts you from your baseline reasonable judgement.
There is a fairly high bar that statistical analysis has to overcome to be net useful!
15) And as our world becomes richer and richer in data--and as it becomes more commonplace to use it and cite it--it's getting misused more and more.
(see also slatestarcodex.com)
16) And this is a failure mode that a _lot_ of people fall into.
The vast majority of statistics that I see quoted are useless.
The times when stats are more likely to be useful are when they are answering a very specific, intentional question.
17) If you've thought hard about a decision you have to make and think you really understand the various factors, and know which factor you're uncertain about, then it can be *extremely* helpful to get some data!
18) But aimlessly generating data just distracts.
It's also very similar to a trap that some interview candidates fall into, particularly those with strong math backgrounds:
Given a hard, messy question, they'll try to solve it exactly.
And if they can't, they get flummoxed.
19) The flipside of overfit, irrelevant data: fermi estimates.
Trying to estimate quantitative factors without knowing all the relevant data is hard, but you can often get reasonable bounds on it.
And those bounds can be extremely useful.

Loading suggestions...