Fun with Big(ger) Data
I have been using databases, probably MySQL and PostgreSQL, for a variety of tasks, such as recording data from our home’s solar panels, analyzing data regarding DKIM deployment, and of course as a back-end database for wikis, blogs, and the family calendars. I have been impressed by what I have been able to do easily and quickly, but (with the possible exception of the DKIM data I was analyzing while at Cisco) I haven’t been dealing with very large databases. So when people tell me that NoSQL databases are much faster than what I have been using, I have had to take their word for it.
The breached authentication database from Adobe has been widely available on the Internet, and I was curious how a database of that size performs and if there were any interesting analytics I could extract. So I recently downloaded the database and imported it into MySQL and MongoDB, a popular NoSQL database. I’ll describe my experiences with MySQL in the rest of this blog post, and MongoDB in my next installment.
The database contains each account’s username (usually an email address), encrypted password, and in many cases a password hint. The passwords were stored in an “unsalted” form, meaning that two users with the same password will have the same encrypted password, permitting analysis of how many users have the same passwords (though we don’t know what that password is). I’m not interested in anyone’s password, although guessing common passwords from the hints used by different users has become a popular puzzle game in some groups. I’m not interested in email addresses, either. However, it’s interesting (to me) to see what the predominant email providers are, and to see the distribution of the same passwords are used, and to experiment with using the database to extract some analytics like this from such a large set of data.
The database, often referred to as users.tar.gz, uncompresses to a file just under 10 GB with about 153 million records. I wrote a simple Perl script to convert the file to a more easily importable form (fixing delimiters, removing blank lines, and splitting email addresses into username and domain). Importing this file (on my 3.33 GHz quad-core i5 with 8 GB memory) took just under 3 hours. I got quite a few warnings, primarily due to extremely long usernames and domain names that exceeded the 64-character limits I had set for those fields.
Here are a few sample timings:
|Import||LOAD DATA LOCAL INFILE “cred” INTO TABLE cred;||2 hr 58 min||152,988,966 records|
|Add index||ALTER TABLE cred ADD INDEX(domain);||2 hr 15 min||—|
|Count Cisco addresses||SELECT COUNT(*) FROM cred WHERE domain = “cisco.com”;||0.12 sec||8552 records|
|Count domains||SELECT COUNT(DISTINCT domain) FROM CRED;||9,326,005 domains||47 sec|
|Domain popularity||SELECT domain, count(*) AS count FROM cred GROUP BY domain ORDER BY count DESC LIMIT 500;||3 min 0 sec||Top domain: hotmail.com (32,571,004 records)|
|Create domain table||CREATE TABLE domains SELECT domain, COUNT(*) AS count FROM cred GROUP BY domain;||9 min 14 sec|
|Index domain table||ALTER TABLE domains ADD INDEX(domain);||5 min 40 sec|
|Null out blank hints||UPDATE cred SET hint = NULL WHERE hint = “”;||3 hr 45 min||109,305,580 blank hints|
|Password popularity||SELECT pw, COUNT(pw) as count, COUNT(hint) as hints FROM cred GROUP BY pw HAVING count>1;||See below|
One thing that is immediately obvious from looking at the domains is that the email addresses aren’t verified. Many of the domain names contain illegal characters. But it’s striking to see how many of the domains are misspelled: Hotmail.com had about 32.5 million users, but there were also 62088 hotmai.com, 23000 hotmal.com, 22171 hotmial.com, 19200 hotmil.com, 15200 hotamil.com, and so forth. Quite a study in mistypings!
I’m puzzled about the last query (password popularity). I expected that, since the database is indexed by pw, it would fairly quickly give me a count for each different pw, like the domain popularity query. I tried this with and without limits, ordering of the results, and with and without the HAVING clause eliminating unique values of pw. At first the query had terminated with a warning that it had exceeded the size of the lock table; increasing the size of a buffer pool took care of that problem. But now it has been running, very I/O bound, for 12 hours and I’m not sure why it’s taking that long when compared with the domain query. If anyone has any ideas, please respond in the comments.
Even with 153 million records, this is nowhere near the scale of many of the so-called Big Data databases, and gives me an appreciation for what they’re doing. I’m not a MySQL expert, and expect that there are more efficient ways to do the above. But I’m still impressed by how much you can do with MySQL and reasonably modest hardware.