0

I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.

JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
        @Override
        public Row call(String line) throws Exception {
            String[] fields = line.split(usedSeparator);
            GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
            return row;
            }
        });

DataFrame df   = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf   =  sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();

DataFrame sampleddf  = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();

I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.

Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?

zero323
  • 322,348
  • 103
  • 959
  • 935
user5490570
  • 65
  • 1
  • 4

1 Answers1

1

I would expect that the sampling is optimally close to ~90% faster.

Well, there are a few reasons why these expectations are unrealistic:

  • without any previous assumptions about data distribution, to obtain an uniform sample, you have to perform a full dataset scan. This is more or less what happens when you use sample or takeSample methods in Spark
  • SELECT * is a relatively lightweight operation. Depending on the amount of resources you have time to process a single partition can be negligible
  • sampling doesn't reduce number of partitions. If you don't coalesce or repartition you can end up with a large number of almost empty partitions. It means suboptimal resource usage.
  • while RNGs are usually quite efficient generating random numbers is not free

There are at least two important benefits of sampling:

  • lower memory usage including less work for the garbage collector
  • less data to serialize / deserialize and transfer in case of shuffling or collecting

If you want to get most from sampling it make sense to sample, coalesce, and cache.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    Grea, thanks for the hint. I really need to try out coalesce then because I also filter the same rdd several times and this means I end up with rdds of the same size, if I understood u right? Cachning is a bit problematic when I have more data than memory. One further question. Why is the "distribution assumption" or size not collected when I read the file into the rdd? – user5490570 Nov 07 '15 at 13:44
  • What I mean by distribution is a matter of statistical properties. If know some thing about this you can sample in a smarter way especially if randomness is not a hard requirement. For example see [BlinkDB](http://blinkdb.org/) – zero323 Nov 12 '15 at 04:01