Running Spark app on EMR is slow

Question

I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time.

For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset.

While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds. (SEE code BELOW)

This is Spark code ( 104.99 seconds):

    private double getTotalUnits (JavaRDD<DataObject> dataCollection)
{
    if (dataCollection.count() > 0) 
    {
        return dataCollection
                .mapToDouble(data -> data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }
}

And this is same Java code without using spark (0.5 seconds)

    private double getTotalOps(List<DataObject> dataCollection)
{
    if (dataCollection.size() > 0)
    {
        return dataCollection
                .stream()
                .mapToDouble(data -> data.getPrice() * data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }

}

I'm new to EMR and Spark, so I don't know, what should I do fix this problem?

UPDATE: This is a single example of the function. My whole task is to calculate different statistics(sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce

score 4 · Accepted Answer · answered Mar 12 '18 at 17:20

4

I believe you are comparing Oranges to Apples. You must understand when to use BigData vs normal Java program?

Big data is not for small size of data to process, The Bigdata framework needs to perform multiple management task in distributed environment, which is a significant overhead. The actual processing time in case of a small data may be very tiny w.r.to the time taken to manage the whole process in hadoop platform. Hence a standalone program is bount to perform better than BigData tools like mapreduce, spark etc.

If you wish to see the difference, make sure to process at least 1 TB of data through the above two program and compare the time taken to process the same.

Apart from above point, BigData brings in fault tolerance in processing. Think about it - what would happen if the JVM crashes (say OutOfMEmoryError) normal Java program execution? In normal java program, simply the whole process collapses. In Bigdata platform, the framework ensures that the processing is not halted, and failure recovery/retry process take place. This makes it fault tolerant and you do not loose the work done on other part of data just because of a crash.

Below table roughly explain, when you should switch to Big Data.

answered Mar 12 '18 at 17:20

Gyanendra Dwivedi

5,511
2
27
53

Thanks for reply. This is useful to know. In my question, I measured just a single part of my app. My whole task is to calculate different statistics (sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce. – WomenWhoCode Mar 12 '18 at 17:27
@HelloWorld Got it! As you may see the data is not even in`medium` category. Do you have the data in one single file or multiple small file? – Gyanendra Dwivedi Mar 12 '18 at 17:32
Thanks!! Yes, I have 10 files, and ~600MB each. – WomenWhoCode Mar 12 '18 at 17:38
3

@HelloWorld Then merge all the files into one and then try `mapreduce/spark`. you should see a slight improvement. In your case, I would suggest to use normal Java program. A multithreaded Java program would be much faster, where each thread is working on the single file and then results are aggregated to make final output. However, you may keep on using `Bigdata` if the volume is likely to increase. Also, `Bigdata` comes with reliability against data loss and processing failure recovery. – Gyanendra Dwivedi Mar 12 '18 at 17:44

Running Spark app on EMR is slow

1 Answers1

Linked