Slow spark application - java

Question

I am trying to create a spark application that takes a dataset of lat, long, timestamp points and increases the cell count if they are inside a grid cell. The grid is comprised of 3d cells with lon,lat and time as the z-axis.

Now I have completed the application and it does what its supposed to, but it takes hours to scan the whole dataset(~9g). My cluster is comprised of 3 nodes with 4 cores,8g ram each and I am currently using 6 executors with 1 core and 2g each.

I am guessing that I can optimize the code quite a bit but is there like a big mistake in my code that results in this delay?

    //Create a JavaPairRDD with tuple elements. For each String line of lines we split the string 
//and assign latitude, longitude and timestamp of each line to sdx,sdy and sdt. Then we check if the data point of 
//that line is contained in a cell of the centroids list. If it is then a new tuple is returned
//with key the latitude, Longitude and timestamp (split by ",") of that cell and value 1.

    JavaPairRDD<String, Integer> pairs = lines.mapToPair(x -> {


        String sdx = x.split(" ")[2];
        String sdy = x.split(" ")[3];
        String sdt = x.split(" ")[0];

        double dx = Double.parseDouble(sdx);
        double dy = Double.parseDouble(sdy);
        int dt = Integer.parseInt(sdt);

        List<Integer> t = brTime.getValue();
        List<Point2D.Double> p = brCoo.getValue();

        double dist = brDist.getValue();
        int dur = brDuration.getValue();

        for(int timeCounter=0; timeCounter<t.size(); timeCounter++) {
            for ( int cooCounter=0; cooCounter < p.size(); cooCounter++) {

                double cx = p.get(cooCounter).getX();
                double cy = p.get(cooCounter).getY();
                int ct = t.get(timeCounter);

                String scx = Double.toString(cx);
                String scy = Double.toString(cy);
                String sct = Integer.toString(ct);

                if (dx > (cx-dist) && dx <= (cx+dist)) {
                    if (dy > (cy-dist) && dy <= (cy+dist)) {
                        if (dt > (ct-dur) && dt <= (ct+dur)) {

                            return new Tuple2<String, Integer>(scx+","+scy+","+sct,1);
                        }
                    }
                }
            }
        }
        return new Tuple2<String, Integer>("Out Of Bounds",1);
    });

I tried from hdfs and from disk but its slow on both occasions. I ve tried 50mb and 350mb parts of the dataset and it takes 300 sec and 10 mins for each — Giannoulo, Aug 29 '18 at 10:05
I think you should load a considerable part of the file into a map and then distribute it for execution, its been quite a while I did spark. Maybe things have changed since. Something like [this](https://stackoverflow.com/questions/42169926/reading-csv-file-in-spark-in-a-distributed-manner) — Skynet, Aug 29 '18 at 10:07
But its already distributed. Its on hdfs and when I say from disk I mean its on the disk of all the nodes at the same path. — Giannoulo, Aug 29 '18 at 10:16
Try to use mapPartitions it's more fast see this exapmle [link](https://github.com/yhemanth/spark-samples/blob/master/src/main/java/com/dsinpractice/spark/samples/core/MapPartitions.java); other thing to do is to put this part of code outside the loop ```timeCounter — M-BNCH, Aug 29 '18 at 10:28

score 0 · Answer 1 · answered Aug 29 '18 at 10:32

One of the biggest factors that may contribute to costs in running a Spark map like this relates to data access outside of the RDD context, which means driver interaction. In your case, there are at least 4 accessors of variables where this occurs: brTime, brCoo, brDist, and brDuration. It also appears that you're doing some line parsing via String#split rather than leveraging built-ins. Finally, scx, scy, and sct are all calculated for each loop, though they're only returned if their numeric counterparts pass a series of checks, which means wasted CPU cycles and extra GC.

Without actually reviewing the job plan, it's tough to say whether the above will make performance reach an acceptable level. Check out your history server application logs and see if there are any stages which are eating up your time - once you've identified a culprit there, that's what actually needs optimizing.

score 0 · Answer 2 · answered Aug 29 '18 at 11:45

0

I tried mappartitionstopair and also moved the calculations of scx,scy and sct so that they are calculated only if the point passes the conditions. The speed of the application has improved dramatically only 17 minutes! I believe that the mappartitionsopair was the biggest factor. Thanks a lot Mks and bsplosion!

answered Aug 29 '18 at 11:45

Giannoulo

45
7

You're welcome :) can you please make my answer as solution to your question – M-BNCH Aug 29 '18 at 11:49

score 0 · Accepted Answer · answered Aug 29 '18 at 11:52

0

Try to use mapPartitions it's more fast see this exapmle link; other thing to do is to put this part of code outside the loop timeCounter

answered Aug 29 '18 at 11:52

M-BNCH

393
1
3
18

Slow spark application - java

3 Answers3