Benford’s Law describes a curious phenomenon about the counterintuitive distribution of numbers in sets of non-random data:
A phenomenological law also called the first digit law, first digit phenomenon, or leading digit phenomenon. Benford’s law states that in listings, tables of statistics, etc., the digit 1 tends to occur with probability ~30%, much greater than the expected 11.1% (i.e., one digit out of 9). Benford’s law can be observed, for instance, by examining tables of logarithms and noting that the first pages are much more worn and smudged than later pages (Newcomb 1881). While Benford’s law unquestionably applies to many situations in the real world, a satisfactory explanation has been given only recently through the work of Hill (1996).
If you list all the countries in the world and their populations, 27% of the numbers will start with the digit 1. Only 3% of them will start with the digit 9. Something very similar holds if you look at the heights of the 60 tallest structures in the world — whether you measure in meters or in feet.
This phenomenon helps auditors detect fraud in things like taxes and elections, but it also connects up in striking ways to modern physics and mathematics (e.g., power laws in statistical distributions, as well as ergodic theory).
Benford’s Law often strikes people as unintuitive because it seems that every digit should have an equal opportunity to start country populations or heights of skyscrapers, like this:
This egalitarian intuition about leading digits turns out to be misleading. The situation where every digit is equally likely to start numbers is actually the anomalous one.
The fact that the nonuniform pattern is the common one was named for physicist Frank Benford, who, in 1938, showed that it holds in a wide variety of real lists of numbers (river lengths, molecular weights, street addresses, etc.) . But the fact was first discovered in 1881 by Simon Newcomb.
He noticed it while using logarithm books — book-length tables giving the logarithms of various numbers, used at that time by scientists to do arithmetic with large numbers. Newcomb became intrigued by the fact that the pages listing numbers starting with 1 were far more worn than the other pages. This would not happen if every digit occurred equally often as a first digit in the numbers scientists worked with.
The fact that most people making up lists of numbers conform to the ‘intuitive’ uniform distribution rather than the nonuninform one that reality seems to prefer is the reason Benford’s Law is useful in fraud detection. The leading digits in large spreadsheets of legitimate financial numbers (light green in the figure below) tend to be very close to Benford’s Law (blue), while ones filled in by guessing randomly look way off (orange), and fraudulent numbers (red) tend to look even more bizarre. When tax sleuths notice these tell-tale patterns of numbers with unnatural sources, they call people in for a human audit.
What are the fraudsters missing?
To get a sense of why the uniform distribution isn’t so natural, we can reason as follows.
First, observe that if you multiply a number by 2, then very often the first digit of the result will be 1. Certainly if the original number started with 5, 6, 7, 8 or 9. So if you begin with the intuitively appealing uniform distribution of leading digits (every leading digit being equally likely) and then multiply all the numbers by 2, the distribution of leading digits will no longer be uniform — there will now be a lot of leading 1′s. Weird, eh?
(To describe this phenomenon, I say that multiplication by 2 privileges 1 as a leading digit.)
This already tells you that the uniform distribution of leading digits is not really very stable. It doesn’t like to persist. It is easy to upset by the innocuous operation of multiplying everything by 2, which is difficult to avoid in the wild!
Second, it turns out that many naturally occurring tables of numbers can be thought of as arising from taking some original list and multiplying each entry by a random number of twos.
In view of this, it is natural that we see lower digits overrepresented, and higher digits underrepresented, in many naturally occurring data sets.
To explore the explanation in more depth, let’s focus on the example of country populations. These tend to grow over time. Think of growing as starting from a random size and being multiplied by 2 a (random) number of times, different for each country (depending on growth rate). Since multiplication by 2 privileges the digit 1 as a leading digit, it’s not surprising that a lot of the final numbers start with ones. More than start with nines.
With building heights, there are two potential explanations. One is the same growth story that we have for cities. Our building ability improves by some random amount every, say, 20 years, and that leads to structures that are some percentage taller than the tallest previous structures. An alternative explanation comes from the fact that we are looking at the largest examples of some phenomenon — the top few “order statistics”. A well-known regularity is that when we are looking at such statistics, the number of structures exceeding height X is proportional to to some power of X. And it is also known that statistics distributed according to a power law follow Benford’s Law.
Maybe the way to think about it is this. To get a list of numbers not to satisfy Benford’s Law you need to build it that way (say, by writing down a list of 6-digit numbers and rolling a 10-sided die to pick all the digits). And then you need to make sure no creature comes along after you are done and multiplies all of them by something a bit unpredictable. But actually, it’s very hard to exclude such a creature, because sometimes it is nature (as with population growth) and sometimes it is another source of unpredictable proportional change. And those idiosyncratic multiplications (or divisions) typically privilege lower initial digits.
This explains the qualitative phenomenon that 1 appears as a leading digit more often than 9 does. But what explains the quantitative Benford’s Law distribution? That is, why do we expect to see that about 30% of numbers start with 1, while 10% of numbers have a leading 4, and only 5% of numbers start with 9? Where do those percentages come from?
We saw above that the uniform distribution of leading digits — an 11% probability for each potential leading digit — is not stable when you multiply all the numbers by 2. If every leading digit starts out being equally represented, that stops being true after you multiply by 2.
It turns out there is a distribution of leading digits that does not get upset after multiplying by 2 in this way — it remains stable. That magical distribution is precisely the Benford’s Law distribution in the first figure. And that’s not just true for multiplication by 2 — the distribution is stable when you multiply by any number between 1 and 10. The Benford’s Law distribution is the only one that has this property, and once you know that, it is easy to work out what it has to be.