6

Cloudera and Hortonworks use HDFS, one of the basic concepts of Apache Hadoop. MapR uses its own concept / implementation. Instead of HDFS, you use the native file system directly. You can find a lot of advantages using this approach on the website of MapR.

I wonder what are the disadvantages of this approach?

Kai Wähner
  • 5,248
  • 4
  • 35
  • 33

4 Answers4

5

I would define MapR a bit differently. It does not use HDFS, but instead of it provides their own distributed file system with NFS interface. which, as well as HDFS is based on local FS.
Main differances are coming from the fact that HDFS is not Posix and other design choices.
1. HDFS is not mutable while MapR is. It can be viewed as advantage, especially if you need it.
2. HDFS is not mountable while MapR is. You can use any existing tools working with Linux FS.

Unrelated to posix: MapR have small block size and not single point of failure (NameNode). MapR Has multisite replication.

lets look on dark side also: a) Having mutable data (instead of not mutable HDFS) makes system more complicated.
b) It is not known (at least for me) to work on huge clusters. (I heard about hundred of nodes).
c) From architecture point (having small blocks) I am not sure how good data locality can be achieved.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • 3
    Regarding David's dark-side comments, (a) mutability makes things much simpler for the user, (b) it works on large clusters... see recent world sort record, (c) small blocks aren't the issue for locality; MapR separates the concepts of disk unit (small blocks), cluster striping unit (like Hadoop block 100's of MB) and scaling constant (30GB instead of Hadoops default 64MB). – Ted Dunning Mar 02 '13 at 21:01
  • Ted - please provide a link to the sort record – David Gruzman Mar 02 '13 at 21:38
  • Dave, Srivas already provided the link. See http://www.mapr.com/blog/hadoop-minutesort-record – Ted Dunning Apr 14 '13 at 22:37
0

David, the minute-sort record was set by MapR on the Google Compute Engine in the Google Cloud on 1/30/2013. See our blog at http://www.mapr.com/blog/hadoop-minutesort-record. The record was set on a 2103-node cluster and 1.5 TB of data was sorted in 59 seconds.

Also see an earlier blog about the Terasort record by MapR sorting 1 TB of data in 54 seconds. It was set on a 1003-node cluster on the Google Compute Engine in the Google Cloud. The blog is posted at http://www.mapr.com/blog/record-setting-hadoop-in-the-cloud.

Also see answers.mapr.com for many questions/answers on this topic.

Srivas
  • 19
  • 1
  • It is very interesting document. I think it would be very useful to have summary of MapR improvements aside of the HDFS replacement. – David Gruzman Mar 03 '13 at 20:05
  • In addition - it is not clear what is file server mentioned in the document, and what was network - 1 GBit or 10 GBit? – David Gruzman Mar 03 '13 at 20:05
  • The file server is the standard MapR distributed file server. The network is 10GbE. See http://www.mapr.com/doc/display/MapR/Start+Here – Ted Dunning Apr 14 '13 at 22:59
  • 1
    Any source other than a MapR blog? I don't see the sort record here: [http://sortbenchmark.org/](http://sortbenchmark.org/). – cabad Oct 21 '13 at 15:05
0

Until some impartial source does extensive benchmarking (under varying workloads) of Apache Hadoop vs. MapR's version, I think we cannot categorically say one is faster than the other. If records are going to determine your opinion, then you should now that the current terasort record is held by Yahoo, with Apache Hadoop. Details here and here.

cabad
  • 4,555
  • 1
  • 20
  • 33
  • Something else to note, "The TeraByte benchmark is now deprecated because it became essentially the same as MinuteSort." REF: http://sortbenchmark.org/ – j.raymond Jan 20 '15 at 21:42
0

The main disadvantage between MapR and Hortonworks/Cloudera is that MapRFS (file system) and MapR-DB (NOSQL database) are proprietary (not open source). If MapR were to no longer exist, it is assumed that these products would cease to be developed and supported.

There is less risk of HDFS/HBase not being developed and supported as Hortonworks, Cloudera and other Hadoop distributions use/support HDFS/HBase along with the open source community.

Larry Advey
  • 180
  • 1
  • 5