8

I am doing now some data analyse tests and in the first, really simple I have got very strange results.

The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

The script I have made in JavaScript is really simple :

/* Counts each domain url */
m = function () {
    emit(this.domain, 1 );
}

r = function (key, values)    {
    total = 0;
    for (var i in values)    {
        total += Number(i);
    }

    return total;
}

/* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);

The equivalent in MySQL is :

drop table if exists NonFTP_Access_log_domain_visits;
create table NonFTP_Access_log_domain_visits (
    `domain` varchar(255) NOT NULL,
    `value` int unsigned not null,
    PRIMARY KEY  (`domain`),
    KEY `value_index` (`value`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    select domain, count(*) as value from NonFTP_Access_log group by domain;
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyse we will have to use Hadoop as MongoDB is really slow. The answers to questions like this say that:

  • MongoDB uses only thread
  • Javascript is just too slow

What am I doing wrong? Are this results normal? Should I use Hadoop?

We are making this test on the following environment:

  • Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
  • RAM: 10 Gb
  • Cores: 32 (AMD Opteron Processor 6128)
David Gruzman
  • 7,900
  • 1
  • 28
  • 30
Ciges
  • 1,133
  • 11
  • 17
  • Does it make sense that hadoop is faster on a single machine? I'd like to know if you configured MySQL for 32 cores, since its not default. And use InnoDB Engine and split your table and make a select union. I doubt that hadoop will get anywhere near MySQL only with a lot of brain lard. – Dr. Dama Oct 04 '13 at 18:37
  • It would be nice if you would put the version of MongoDB used for your question. That way if MongoDB is optimized, fixed or re-architected in time we'll be able to target the question to a specific version. – Jeach Aug 03 '14 at 14:34

3 Answers3

12

I've actually answered this very similar question before. The limitations of Map Reduce in MongoDB have been outlined previously - as you mentioned, it is single threaded, it has to be converted to Java Script (spidermonkey) and back etc.

That is why there are other options:

  1. The MongoDB Hadoop Connector (officially supported)
  2. The Aggregation Framework (Requires 2.1+)

As of this writing the 2.2.0 stable release was not yet out, but it was up to RC2, so the release should be imminent. I would recommend giving it a shot as a more meaningful comparison for this type of testing.

Community
  • 1
  • 1
Adam Comerford
  • 21,336
  • 4
  • 65
  • 85
  • Hello, and thanks for your quick answer. We are using the 2.2.0rc0 MongoDB version (because this version does offers the Aggregation Framework). The problem with the [Aggregation Framework is that the group function does not support sharding](http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group). So we will have to use MapReduce (as we don't want to develop queries without this limit). I will read more about [MongoDB Hadoop Connector](http://docs.mongodb.org/manual/applications/aggregation/). Thanks again. – Ciges Aug 27 '12 at 13:14
  • 3
    The aggregation framework not working in a sharded environment sounded odd to me, and then I realized that the reference you gave is not about the aggregation framework, it is about the old group function, that is distinct from the aggregation framework, see: http://docs.mongodb.org/manual/applications/aggregation/#sharded-operation – Adam Comerford Aug 27 '12 at 15:08
  • 1
    I also fixed the linked note so that it more accurately describes the limitations of the group() function and clears up any confusion about the aggregation framework functioning in a sharded set up :) – Adam Comerford Aug 27 '12 at 15:16
  • I'm much more of a Cassandra user than a MongoDB, so pardon the naive question here... Would it help to simplify the SQL query to minimize the time spent getting data into hadoop and then using hadoop's scaling abilities to perform the rest of the query (count(*), group by, etc)? – Chris Gerken Aug 27 '12 at 20:13
5

Apparently using the group function on Aggregation Framework works well! :-)

The following Javascript code gets the 10 most visited domains with their visits in 17m17s!

db.NonFTP_Access_log.aggregate(
    { $group: {
        _id: "$domain",
        visits: { $sum: 1 }
        }},
    { $sort: { visits: -1 } },
    { $limit: 10 }
    ).result.forEach(printjson);

Anyway I still don't understand why the MapReduce alternative is so slow. I have opened the following question in the MongoDB JIRA.

Ciges
  • 1,133
  • 11
  • 17
  • 1
    this is consistent with a lot of the tests I've run - aggregation framework is (on average) anywhere from 10x to 30x the speed of map/reduce for the same aggregation. Your case of 17 minutes instead of 1800 minutes is the most extreme difference I've seen :) – Asya Kamsky Aug 29 '12 at 01:41
0

I think your result quite normal and will try to justify them<.br> 1. MySQL is using binary format which is optimized for the processing while MongoDB is working with JSON. So time of parsing is added to the processing. I would estimate it to factor 10x at least.
2. JS is indeed much slower then C. I think at least factor of 10 can be assumed. Together we get about x100 - similar to what you see. 20 minut x 1000 is 2000 minutes or about 33 hours.
3. Hadoop is also not efficient for data processing but it is capable to use all cores you have and it makes difference. Java also has JIT developed and optimized for more then 10 years.
4. I would suggest to look not on MySQL but on TPC-H benchmark Q1 - which is pure aggregation. I think systems like VectorWise will show maximum possible throughput per core.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • Isn't MongoDB using BSON, which itself is a binary version (of JSON)? If so, then your first point does not hold. – Jeach Aug 03 '14 at 14:39
  • I beileve that BSON is still be less efficient to process then RDBMS binary data format (especially with fixed size types, where simple placement new can be used to access data). – David Gruzman Aug 11 '14 at 14:48