Percentiles in spark - most efficient method (RDD vs SqlContext)

Question

I have a large grouped dataset in spark that I need to return the percentiles from 0.01 to 0.99 for.

I have been using online resources to determine different methods of doing this, from operations on RDD: How to compute percentiles in Apache Spark

To SQLContext functionality: Calculate quantile on grouped data in spark Dataframe My question is does anyone have any opinion on what the most efficient approach is?

Also as a bonus, in SQLContext there is functions for both percentile_approx and percentile. There isn't much documentation available online for 'percentile' is this just a non-approximated 'percentile_approx' function?

score 0 · Answer 1 · answered Mar 09 '18 at 20:09

Dataframes will be more efficient in general. Read this for details on the reasons - https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html.

There are a few benchmarks out there as well. For example this one claims that "the new DataFrame API is faster than the RDD API for simple grouping and aggregations".

You can look up Hive's documentation to figure out difference between percentile and percentile_approx.

Percentiles in spark - most efficient method (RDD vs SqlContext)

1 Answers1