0

I'm using spark to find medium for large dataset, around(300PB). What's the best way to optimize? (BTW, the result doesn't have to be strictly accurate)

Yongcong Luo
  • 53
  • 1
  • 9

1 Answers1

0

You could solve this problem in two way: 1-using meanApprox(long timeout, double confidence) function which returns the approximate mean within a timeout and confidence.

2-You can use Sample(double fraction, boolean withReplacement, long seed, SparkPlan child) method for doing your purpose for example:

sampledRDD = rdd.sample(False, sample, seed)
approxMean = sampledRDD.mean()

I wish this will you to solve your problem. for more detail you could visit https://spark.apache.org/docs for more information.