I'm using spark to find medium for large dataset, around(300PB). What's the best way to optimize? (BTW, the result doesn't have to be strictly accurate)
Asked
Active
Viewed 104 times
0
-
you mean, median? – Jay Vignesh Jul 15 '20 at 06:50
-
hope this helps you - https://stackoverflow.com/questions/31432843/how-to-find-median-and-quantiles-using-spark – dsk Jul 15 '20 at 08:00
1 Answers
0
You could solve this problem in two way: 1-using meanApprox(long timeout, double confidence) function which returns the approximate mean within a timeout and confidence.
2-You can use Sample(double fraction, boolean withReplacement, long seed, SparkPlan child) method for doing your purpose for example:
sampledRDD = rdd.sample(False, sample, seed)
approxMean = sampledRDD.mean()
I wish this will you to solve your problem. for more detail you could visit https://spark.apache.org/docs for more information.

Vahid Shahrivari
- 127
- 2
- 8
-
Does find median for sample gives you a similar number to the whole dataset. How to ensure they are similar? – Yongcong Luo Jul 16 '20 at 23:42
-
For "mean", "count" and "sum" we could provide confidence in relate to size of sample but for median, It depends if your data is sorted or not. – Vahid Shahrivari Jul 18 '20 at 06:48
-