More Fun with Big(ger) Data: MongoDB

Last week, I blogged about my experiences using MySQL with a relatively large set of data: 10 GB and 153 million records. A lot of things worked well, but a few were surprisingly slow. This week, I’ll try to do roughly the same things with MongoDB, a popular “NoSQL” database, to get experience with it and to test the assertion that NoSQL databases are better suited for Big Data than traditional databases. This is my very first experience with MongoDB, so this is coming from the perspective of a complete newbie. There are probably better ways to do many of the things I’m describing.
One of the nice things about MongoDB is that it is schemaless. One just needs to give names to the columns (now known as fields) in the input data, and it will figure out what type of data is there and store it accordingly. If some fields are missing, that’s OK. So the data import is considerably easier to set up than with MySQL, and it doesn’t generate warnings for rows (now called documents) that have ridiculously long user or domain names.
That is not to say that mongoimport was trouble-free. There is a 2 GB limitation on databases when using MongoDB in 32-bit mode, but I hadn’t paid much attention. When I imported the data, it began fast but slowed down gradually, as if it was adding to an indexed database. After running overnight, by morning it had slowed to about 500 records/second. A friend looked at it with me, and suggested I look at the log file, which contained an tens of gigabytes of error messages telling me about the 32-bit limitation (had I known to look). It would have been preferable for mongoimport to have thrown an error than to have silently logged errors until the disk filled up. So if you’re dealing with any significant amount of data, be sure you’re running the 64-bit version (I had to upgrade my Linux system to do this), and remember to check the log files frequently when using MongoDB.
Once I upgraded to 64-bit Linux (a significant task, but something I needed to do anyway), the import went smoothly, and about three times as fast as MySQL.
Here are timings from the same, or similar, tasks tried with MySQL:
Task | Command | Time | Result |
Import | mongoimport –d adobe –c cred –file cred –type tsv –fields id,x,username,domain,pw.hint |
1 hr 1 min | 152,989,513 documents |
Add index | db.cred.ensureIndex({domain:1}) | 34 min 29 sec | |
Count Cisco addresses | db.cred.find({domain:”cisco.com”}).count() | 0.042 sec | 8552 documents |
Count domains | db.cred.aggregate([{ $group: { _id: “$domain”} },{ $group: { _id: 1, count: { $sum: 1 } } } ]) | 3min 45 sec | 9,326,393 domains |
Domain popularity | Various | See below | |
Count entries without hints | db.cred.find({“hint”:{“$exists”:false}}) .count() | 3 min 39 sec | 109,190,313 documents |
One of the striking differences from MySQL is the command structure. While MySQL operations have somewhat of a narrative structure, MongoDB has a much more API-like flavor. The command shell for MongoDB is, in fact, a JavaScript shell. I’m not particularly strong in JavaScript, so it was a bit foreign, but workable, for me.
Several of the commands were, as expected, faster than with MySQL. But commands that needed to “touch” a lot of data and/or indexes thrashed badly, because MongoDB is an in-memory database and ran up to about 90 GB of virtual memory, causing many page faults when the data being accessed were widely dispersed.
It was when I tried to determine the most frequently used domains that things really got bogged down. I tried initially to do this with an aggregate operation similar to the domain count command, but this failed because of a limitation in the size of the aggregation it could perform. I next tried MongoDB’s powerful MapReduce capability, and it seemed to be thrashing the server. I finally wrote a short Python program that I thought would run quickly because the database was indexed by domain and could present the documents (records) in domain order, but even that got bogged down by the thrashing of the database process when it went to get more data. Using a subset of 1 million documents, these methods worked well, but not at the scale I was attempting, at least with my hardware.
So there were things I liked and didn’t like about MongoDB:
I Liked:
- API-style interface that translated easily into running code
- Speed to find records with a limited number of results
- Loose schema: Free format of documents (records)
I Disliked:
- Cumbersome syntax for ad-hoc “what if” queries from the shell
- Speed of processing entire database (due to thrashing)
- Loose coupling between shell and database daemon: terminating the shell wouldn’t necessarily terminate the database operation
MongoDB is well suited for certain database tasks. Just not the ones I expected.
Leave a Reply