I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time.
For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset.
While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds. (SEE code BELOW)
This is Spark code ( 104.99 seconds):
private double getTotalUnits (JavaRDD<DataObject> dataCollection)
{
if (dataCollection.count() > 0)
{
return dataCollection
.mapToDouble(data -> data.getQuantity())
.sum();
}
else
{
return 0.0;
}
}
And this is same Java code without using spark (0.5 seconds)
private double getTotalOps(List<DataObject> dataCollection)
{
if (dataCollection.size() > 0)
{
return dataCollection
.stream()
.mapToDouble(data -> data.getPrice() * data.getQuantity())
.sum();
}
else
{
return 0.0;
}
}
I'm new to EMR and Spark, so I don't know, what should I do fix this problem?
UPDATE: This is a single example of the function. My whole task is to calculate different statistics(sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce