57

Considering a MySQL products database with 10 millions products for an e-commerce website.

I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop.

I wanted to use Mahout over it as a Machine Learning framework to use one of it's Classification algorithms, and then I ran into Spark which is provided with MLlib

  • So what is the difference between the two frameworks?
  • Mainly, what are the advantages,down-points and limitations of each?
eaykin
  • 3,713
  • 1
  • 37
  • 33
eliasah
  • 39,588
  • 11
  • 124
  • 154

2 Answers2

47

The main difference will come from underlying frameworks. In case of Mahout it is Hadoop MapReduce and in case of MLib it is Spark. To be more specific - from the difference in per job overhead
If your ML algorithm mapped to the single MR job - main difference will be only startup overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important.
Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.
Lets assume that we need 100 iterations, each needed 5 seconds of cluster CPU.

  • On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
  • On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.

In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount - I would consider Mahout as serious alternative.

Ismail H
  • 4,226
  • 2
  • 38
  • 61
David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • 8
    Future releases of Mahout will also use Spark instead of (or in addition to) MapReduce, as announced in April 2014. – herman Jan 22 '15 at 14:54
  • 3
    Good to know. But what will be difference with MLib then? – David Gruzman Jan 22 '15 at 22:01
  • Then, now that Mahout is based on Spark, What's the difference between Mahout and Spark? – skan Mar 06 '15 at 00:21
  • The provided jobs of Mahout 1.0 is still using MapReduce, which spends enormous time compare to the same task by using Spark. – shihpeng May 25 '15 at 17:09
  • I guess Mlib is still in its early days compared to Mahout. Mahout algorithms are plenty and they would support spark and Mapreduce – Rakshith Jul 08 '15 at 07:11
  • 2
    I feel like this answer is lacking a main difference, which is that they don't implement the same list of algorithms. I've generally found that Mahout has a wider selection. If there are specific machine learning algorithms you are planning to use, make sure they are available in the framework you choose. – Nadine May 02 '16 at 09:15
42

Warning--major edit:

MLlib is a loose collection of high-level algorithms that runs on Spark. This is what Mahout used to be only Mahout of old was on Hadoop Mapreduce. In 2014 Mahout announced it would no longer accept Hadoop Mapreduce code and completely switched new development to Spark (with other engines possibly in the offing, like H2O).

The most significant thing to come out of this is a Scala-based generalized distributed optimized linear algebra engine and environment including an interactive Scala shell. Perhaps the most important word is "generalized". Since it runs on Spark anything available in MLlib can be used with the linear algebra engine of Mahout-Spark.

If you need a general engine that will do a lot of what tools like R do but on really big data, look at Mahout. If you need a specific algorithm, look at each to see what they have. For instance Kmeans runs in MLlib but if you need to cluster A'A (a cooccurrence matrix used in recommenders) you'll need them both because MLlib doesn't have a matrix transpose or A'A (actually Mahout does a thin-optimized A'A so the transpose is optimized out).

Mahout also includes some innovative recommender building blocks that offer things found in no other OSS.

Mahout still has its older Hadoop algorithms but as fast compute engines like Spark become the norm most people will invest there.

pferrel
  • 5,673
  • 5
  • 30
  • 41
  • 1
    Then, now that Mahout is based on Spark, What's the difference between Mahout and Spark?. Will Spark replace Mahout gradually? – skan Mar 06 '15 at 00:22
  • 2
    The old hadoop mapreduce based Mahout--yes. But I don't think the as yet unnamed Mahout-Spark DSL, which is a generalized algebraic solver and environment is anything like MLlib. Since it runs on Spark and can use anything in MLlib it doesn't seek to reimplement all that but concentrates on being general something like R but on huge data sets. – pferrel Mar 07 '15 at 01:45
  • Mahout reinvented itself and - as alluded to by pferrel - has become relevant and interesting again. It has in some area a more solid linear algebra underpinning than mllib – WestCoastProjects Oct 02 '15 at 06:16