0

My company has two jobs, we will choose just one to begin with spark. The tasks are:

  1. The first job does Analysis of a large quantity of text to look for ERROR messages (grep).
  2. The second job does machine learning & calculate models prediction on some data with an iterative way.

My question is: Which one of the two jobs will benefit from SPARK the most?

SPARK relies on memory so I think that it is more suited to machine learning. The quantity of DATA isn't that large compared to the logs JOB. But I'm not sure. Can someone here help me if I neglected some piece of information ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Melchia
  • 22,578
  • 22
  • 103
  • 117

2 Answers2

0

Spark deployment strategy depends of the volume of the data and how you receive it. It can be fit in both scenario and in your application.

Scenario 1 - You can deploy spark for your 1st job as well if you receive streaming data. Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using different functions of Spark. Finally, processed data can be pushed out to Hadoop HDFS filesystems.

If your data is already on HDFS, still you can use Spark to process it. It will enable your processing faster. However if it is batch processing, and if you don't have sufficient resource in your Hadoop cluster, MapReduce is preferred for this kind of Scenario.

Scenario 2 - Your 1st application will process the data and store on HDFS, You can use Spark MLlib Operations here for further operations.Please validate the kinds of predictions you will be performing using this.

Finally, here I can say Spark is suited for both of your Scenario and you can use it for both operations.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • *MapReduce is preferred for this kind of Scenario*. I don't think so... YARN MR2 containers still need adequate memory to outperform Spark – OneCricketeer Jan 07 '18 at 16:20
  • @cricket_007 YARN or MR2 doesn't load entire data into memory at once. Memory allocations are depends on number of Mapper task allocated for the job. Container memory allocations are controlled by properties `yarn.scheduler.minimum-allocation-mb` and `yarn.scheduler.maximum-allocation-mb`. Please check my answer https://stackoverflow.com/questions/43826703/difference-between-yarn-scheduler-maximum-allocation-mb-and-yarn-nodemanager for more detail on memory allocation. – Sandeep Singh Jan 07 '18 at 16:57
  • Sure, but allowing those to be within the same range as the executor memory will be better than Spark? – OneCricketeer Jan 07 '18 at 16:59
  • executor memory can be allocated within range but if the RDD doesn't fit in memory it will start spilling data into disk during processing. Hence MapReduce are still preferred for large volume Data Size (Say in terabytes and petabytes) – Sandeep Singh Jan 07 '18 at 17:16
  • Are you only referring to speed of keeping data in memory? MR also can spill or run into OOM. I've ran Spark on TB datasets before, so I don't understand that point – OneCricketeer Jan 07 '18 at 17:20
  • no there are many factor that can be considered before selecting the right framework. If Spark runs on Shared Hadoop YARN with other resource demanding services and if the data is too big to fit entirely into the memory, then there could be major performance degradation for Spark. MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences. – Sandeep Singh Jan 07 '18 at 17:32
  • There is good comparison between Spark and MapReduce here https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/ – Sandeep Singh Jan 07 '18 at 17:33
  • Spark doesn't kill previous stage processes? I've ran into plenty of MR slowness / performance degradation on a multi-tenant YARN cluster with the capacity scheduler – OneCricketeer Jan 07 '18 at 17:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162693/discussion-between-sandeep-singh-and-cricket-007). – Sandeep Singh Jan 07 '18 at 17:42
0

Here is a good answer I found in Data Science:

I think second job will benefit more from spark than the first one. The reason is machine learning and predictive models often run multiple iterations on data.

As you have mentioned, spark is able to keep data in memory between two iterations while Hadoop MapReduce has to write and read data to file system.

Here is a good comparison of the two frameworks :

https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

enter image description here

As much as I agree with you @Sandeep Singh , I must say that Hadoop isn't best suited for a big number of Iterative operations.

Melchia
  • 22,578
  • 22
  • 103
  • 117