Spark Sampling - How much faster is it than using the full RDD/DataFrame

Question

I'm wondering what the runtime of Spark is when sampling a RDD/DF compared with the runtime of the full RDD/DF. I don't know if it makes a difference but I'm currently using Java + Spark 1.5.1 + Hadoop 2.6.

JavaRDD<Row> rdd = sc.textFile(HdfsDirectoryPath()).map(new Function<String, Row>() {
        @Override
        public Row call(String line) throws Exception {
            String[] fields = line.split(usedSeparator);
            GenericRowWithSchema row = new GenericRowWithSchema(fields, schema);//Assum that the schema has 4 integer columns
            return row;
            }
        });

DataFrame df   = sqlContext.createDataFrame(rdd, schema);
df.registerTempTable("df");
DataFrame selectdf   =  sqlContext.sql("Select * from df");
Row[] res = selectdf.collect();

DataFrame sampleddf  = sqlContext.createDataFrame(rdd, schema).sample(false, 0.1);// 10% of the original DS
sampleddf.registerTempTable("sampledf");
DataFrame selecteSampledf = sqlContext.sql("Select * from sampledf");
res = selecteSampledf.collect();

I would expect that the sampling is optimally close to ~90% faster. But for me it looks like that spark goes through the whole DF or does a count, which basically takes nearly the same time as for the full DF select. After the sample is generated, it executes the select.

Am I correct with this assumptions or is the sampling used in a wrong way what causes me to end up with the same required runtime for both selects?

score 1 · Accepted Answer · answered Nov 07 '15 at 05:38

I would expect that the sampling is optimally close to ~90% faster.

Well, there are a few reasons why these expectations are unrealistic:

without any previous assumptions about data distribution, to obtain an uniform sample, you have to perform a full dataset scan. This is more or less what happens when you use sample or takeSample methods in Spark
SELECT * is a relatively lightweight operation. Depending on the amount of resources you have time to process a single partition can be negligible
sampling doesn't reduce number of partitions. If you don't coalesce or repartition you can end up with a large number of almost empty partitions. It means suboptimal resource usage.
while RNGs are usually quite efficient generating random numbers is not free

There are at least two important benefits of sampling:

lower memory usage including less work for the garbage collector
less data to serialize / deserialize and transfer in case of shuffling or collecting

If you want to get most from sampling it make sense to sample, coalesce, and cache.

Grea, thanks for the hint. I really need to try out coalesce then because I also filter the same rdd several times and this means I end up with rdds of the same size, if I understood u right? Cachning is a bit problematic when I have more data than memory. One further question. Why is the "distribution assumption" or size not collected when I read the file into the rdd? — user5490570, Nov 07 '15 at 13:44
What I mean by distribution is a matter of statistical properties. If know some thing about this you can sample in a smarter way especially if randomness is not a hard requirement. For example see [BlinkDB](http://blinkdb.org/) — zero323, Nov 12 '15 at 04:01

Spark Sampling - How much faster is it than using the full RDD/DataFrame

1 Answers1

Linked