How much entropy is in a name?
I have been thinking about authentication, and particularly knowledge-based authentication (KBA) lately. There are many variations of and uses for KBA, one of the common forms is the challenge or “security questions” that we are often asked to use as backup authentication. Sometimes these questions are chosen by the user, and sometimes by the service doing the authentication.
A classic challenge question dating from well before the use of online services is, “What is your mother’s maiden name?” Since one measure of authentication strength we often use is the entropy (roughly, how hard is it to randomly guess the correct value), I thought it might be interesting, at least as an academic exercise, to figure out the entropy associated with last names in the United States.
The Shannon entropy of a set of values is defined as:
where H is the entropy and Pi is the probability of occurrence of the i-th value in the sample set. If the logarithm is taken base 2, the value of H is in bits of entropy.
I found a resource from the 2000 US Census that lists the frequency of occurrence for all surnames in the United States with at least 100 members. The list consists of 151,671 names beginning with Smith and ending with a tie between 236 names having 100 members each. The list covers just under 90% of the population, the other 10% presumably belonging to rarer surnames.
The names and probabilities (expressed in occurrences per 100,000 population) together yielded an entropy of just over 12.3 bits, about what I expected. But what about the missing 10%?
Dividing the cumulative probability (89.754%) into the total number of occurrences (242,121,073) yielded the total sampled population, 269,782,083. So 27,641,010 names had fewer than 100 occurrences. How much entropy do they represent? Without more data, an upper bound is the assumption that each of those names has one occurrence: 27.6 million names each with a probability of (1/269.8 million), which works out to 2.87 bits. A lower bound is the assumption that each of those names has 99 occurrences: 27.6 million/99 = 279,202 names each with probability (99/269.8 million), which works out to 2.19 bits. So the total entropy for family names in the United states is about 15 bits.
This still makes some assumptions that are not correct. One issue, if the name is supposedly a given person’s mother’s maiden name, is that there is likely to be correlation between names of similar ethnicity that is not reflected in this methodology, reducing the actual amount of entropy. For comparison, NIST SP 800-63-2 requires at least 14 bits of entropy for pre-registered knowledge tokens at Level of Assurance (LOA) 1, and 20 bits at LOA 2. Using a family name as such a token is therefore marginal at LOA 1 and unacceptable at LOA 2.
But of course the real threat isn’t someone guessing your mother’s maiden name, but rather that an attacker can get the answer from somewhere. Genealogy databases are particularly good places to obtain someone’s mother’s maiden name. There’s also a very good probability that a relative with that last name is a “friend” on Facebook or a follower on Twitter, and it’s very easy in almost all cases to obtain a list of someone’s Facebook friends and Twitter followers. Any of these techniques removed the need for attackers to make blind guesses, which makes the calculation above, although a fun exercise, moot.