December 4, 2013 / Jim Fenton

Fun with Big(ger) Data

I have been using databases, probably MySQL and PostgreSQL, for a variety of tasks, such as recording data from our home’s solar panels, analyzing data regarding DKIM deployment, and of course as a back-end database for wikis, blogs, and the family calendars. I have been impressed by what I have been able to do easily and quickly, but (with the possible exception of the DKIM data I was analyzing while at Cisco) I haven’t been dealing with very large databases. So when people tell me that NoSQL databases are much faster than what I have been using, I have had to take their word for it.

The breached authentication database from Adobe has been widely available on the Internet, and I was curious how a database of that size performs and if there were any interesting analytics I could extract. So I recently downloaded the database and imported it into MySQL and MongoDB, a popular NoSQL database. I’ll describe my experiences with MySQL in the rest of this blog post, and MongoDB in my next installment.

The database contains each account’s username (usually an email address), encrypted password, and in many cases a password hint. The passwords were stored in an “unsalted” form, meaning that two users with the same password will have the same encrypted password, permitting analysis of how many users have the same passwords (though we don’t know what that password is). I’m not interested in anyone’s password, although guessing common passwords from the hints used by different users has become a popular puzzle game in some groups. I’m not interested in email addresses, either. However, it’s interesting (to me) to see what the predominant email providers are, and to see the distribution of the same passwords are used, and to experiment with using the database to extract some analytics like this from such a large set of data.

The database, often referred to as users.tar.gz, uncompresses to a file just under 10 GB with about 153 million records. I wrote a simple Perl script to convert the file to a more easily importable form (fixing delimiters, removing blank lines, and splitting email addresses into username and domain). Importing this file (on my 3.33 GHz quad-core i5 with 8 GB memory) took just under 3 hours. I got quite a few warnings, primarily due to extremely long usernames and domain names that exceeded the 64-character limits I had set for those fields.

Here are a few sample timings:

Task	Command	Time	Result
Import	LOAD DATA LOCAL INFILE “cred” INTO TABLE cred;	2 hr 58 min	152,988,966 records
Add index	ALTER TABLE cred ADD INDEX(domain);	2 hr 15 min	—
Count Cisco addresses	SELECT COUNT(*) FROM cred WHERE domain = “cisco.com”;	0.12 sec	8552 records
Count domains	SELECT COUNT(DISTINCT domain) FROM CRED;	9,326,005 domains	47 sec
Domain popularity	SELECT domain, count(*) AS count FROM cred GROUP BY domain ORDER BY count DESC LIMIT 500;	3 min 0 sec	Top domain: hotmail.com (32,571,004 records)
Create domain table	CREATE TABLE domains SELECT domain, COUNT(*) AS count FROM cred GROUP BY domain;	9 min 14 sec
Index domain table	ALTER TABLE domains ADD INDEX(domain);	5 min 40 sec
Null out blank hints	UPDATE cred SET hint = NULL WHERE hint = “”;	3 hr 45 min	109,305,580 blank hints
Password popularity	SELECT pw, COUNT(pw) as count, COUNT(hint) as hints FROM cred GROUP BY pw HAVING count>1;	See below

One thing that is immediately obvious from looking at the domains is that the email addresses aren’t verified. Many of the domain names contain illegal characters. But it’s striking to see how many of the domains are misspelled: Hotmail.com had about 32.5 million users, but there were also 62088 hotmai.com, 23000 hotmal.com, 22171 hotmial.com, 19200 hotmil.com, 15200 hotamil.com, and so forth. Quite a study in mistypings!

I’m puzzled about the last query (password popularity). I expected that, since the database is indexed by pw, it would fairly quickly give me a count for each different pw, like the domain popularity query. I tried this with and without limits, ordering of the results, and with and without the HAVING clause eliminating unique values of pw. At first the query had terminated with a warning that it had exceeded the size of the lock table; increasing the size of a buffer pool took care of that problem. But now it has been running, very I/O bound, for 12 hours and I’m not sure why it’s taking that long when compared with the domain query. If anyone has any ideas, please respond in the comments.

Even with 153 million records, this is nowhere near the scale of many of the so-called Big Data databases, and gives me an appreciation for what they’re doing. I’m not a MySQL expert, and expect that there are more efficient ways to do the above. But I’m still impressed by how much you can do with MySQL and reasonably modest hardware.

← The Twitter Echo Chamber

More Fun with Big(ger) Data: MongoDB →

3 Comments

Trackbacks

More Fun with Big(ger) Data: MongoDB | Altmode

Fun with Big(ger) Data

Like this:

Related

3 Comments

Trackbacks

Leave a ReplyCancel reply

Search

Recent Posts

Travel Journals

Syndication

Fun with Big(ger) Data

Share this:

Like this:

Related

3 Comments

Trackbacks

Leave a ReplyCancel reply

Search

Recent Posts

Travel Journals

Tags

Syndication

Discover more from Altmode