1

i would like to perform a Apache Spark map-reduce on 5 files and output them to mongodb. I would prefer not using HDFS since NameNodes are a single point of failure (http://wiki.apache.org/hadoop/NameNode).

A. Is it possilbe to read multiple files in RDD, perform a map reduction on a key from all the files and use the casbah toolkit to output the results to mongodb

B. Is it possible to use the client to read from mongodb into RDD, perform a map reduce and right output back to mongodb using the casbah toolkit

C. Is it possible to read multiple files in RDD, map them with keys that exist in mongodb, reduce them to a single document and insert them back into mongodb

I know all of this is possible using the mongo-hadoop connector. I just dont like the idea of using HDFS since it is a single point of failure and backUpNameNodes are not implemented yet.

Ive read some things on line but they are not clear.

MongoDBObject not being added to inside of an rrd foreach loop casbah scala apache spark

Not sure whats going on there. The JSON does not even appear to be valid...

resources:

https://github.com/mongodb/casbah

http://docs.mongodb.org/ecosystem/drivers/scala/

Community
  • 1
  • 1

1 Answers1

1

Yes. I haven't used MongoDB, but based on other things I've done with Spark, these should all be quite possible.

However, do keep in mind that a Spark application is not typically fault-tolerant. The application (aka "driver") itself is a single point of failure. There's a related question on that topic (Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode), but I think it doesn't have a really good answer at the moment.

I have no experience running a critical HDFS cluster, so I don't know how much of a problem the single point of failure is. But another idea may be running on top of Amazon S3 or Google Cloud Storage. I would expect these to be way more reliable than anything you can cook up. They have large support teams and lots of money and expertise invested.

Community
  • 1
  • 1
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • But another idea may be running on top of Amazon S3 or Google Cloud Storage. I would expect these to be way more reliable than anything you can cook up @daniel - no doubt that is true but it is not an option. Could you possibly elaborate more on how the driver is a single point of failure? do you mean the actual scala/java/python map-reduce application or the casbah toolkit? or spark itself? – user1290942 Feb 02 '15 at 21:41
  • 1
    The scala/java/python application itself. If it dies, everything stored in Spark RDDs is lost. – Daniel Darabos Feb 02 '15 at 22:29
  • Of course if you only use RDDs temporarily and store your data in MongoDB, you're probably in a good spot! – Daniel Darabos Feb 02 '15 at 22:30