0

Everywhere on google the key difference between Spark and Hadoop MapReduce is stated in the approach to processing: Spark can do it in-memory, while Hadoop MapReduce has to read from and write to a disk. Looks like I got it, but I would like to confirm it with an example.

Consider this word count example:

 val text = sc.textFile("mytextfile.txt") 
 val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_) 
 counts.collect

My understanding:

In case of Spark, once the lines are split by " ", output will be stored in memory. Similarly with functions map and reduce. I believe same is true for when processing is happening across partitions.

In the case of MapReduce , will each intermediate results (like words after split/map/reduce) be kept on disk i.e. HDFS, which makes it slower compared to Spark? There is no way we can keep them in memory ? Same is the case of partitions results ?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
emilly
  • 10,060
  • 33
  • 97
  • 172

1 Answers1

0

Yes, you are right.

The SPARK intermediate RDD (Resilient Distributed Dataset) results are kept in memory and hence latency is a lot lower and job throughput higher. RDDs have partitions, chunks of data like MR. SPARK also offers iterative processing, also a key point to consider.

MR does have a Combiner of course to ease the pain a little.

But SPARK is far easier to use as well, with Scala or pyspark.

I would not worry about MR anymore - in general.

Here is an excellent read on SPARK BTW: https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • what about my question on map reduce in context of example ? Will it keep will each intermediate results (like words after split/map/reduce) on disk like HDFS ? – emilly May 12 '19 at 14:11
  • Local file system as opposed to HDFS. @emilly – thebluephantom May 12 '19 at 14:54
  • Actually what I am asking is this - Under hadoop mapreduce this code snippet `text.flatMap(line => line.split(" "))` will store the result on disk where as spark will keep the result in memory . Is that correct ? I am wondering why MR will keep the result here on disk that when it knows that it has to process it in very next line ? – emilly May 14 '19 at 01:39
  • Yes. Doug Cutting made that design choice then as it is easier to implement than Spark DAG, and disk was dar cheaper than memory. Of course SSDs are now used a lot. – thebluephantom May 14 '19 at 06:08
  • can you please take a look at https://stackoverflow.com/questions/56147130/how-spark-internally-works-with-this-usecase ? – emilly May 15 '19 at 17:53