Best way to handle medium for large dataset

Question

I'm using spark to find medium for large dataset, around(300PB). What's the best way to optimize? (BTW, the result doesn't have to be strictly accurate)

hope this helps you - https://stackoverflow.com/questions/31432843/how-to-find-median-and-quantiles-using-spark — dsk, Jul 15 '20 at 08:00

score 0 · Answer 1 · answered Jul 16 '20 at 07:45

0

You could solve this problem in two way: 1-using meanApprox(long timeout, double confidence) function which returns the approximate mean within a timeout and confidence.

2-You can use Sample(double fraction, boolean withReplacement, long seed, SparkPlan child) method for doing your purpose for example:

sampledRDD = rdd.sample(False, sample, seed)
approxMean = sampledRDD.mean()

I wish this will you to solve your problem. for more detail you could visit https://spark.apache.org/docs for more information.

answered Jul 16 '20 at 07:45

Vahid Shahrivari

127
2
8

Does find median for sample gives you a similar number to the whole dataset. How to ensure they are similar? – Yongcong Luo Jul 16 '20 at 23:42
For "mean", "count" and "sum" we could provide confidence in relate to size of sample but for median, It depends if your data is sorted or not. – Vahid Shahrivari Jul 18 '20 at 06:48
Why does sorted or not matter? what are the differences? – Yongcong Luo Jul 20 '20 at 00:50

Best way to handle medium for large dataset

1 Answers1