10

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

my desire output would be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I take the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

I am using Spark 2.0.0

zero323
  • 322,348
  • 103
  • 959
  • 935
chessosapiens
  • 3,159
  • 10
  • 36
  • 58

1 Answers1

16

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+ 

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • thanks, i am going to test it , please correct me if i am wrong , the reason i get that error is that approxQuantile is not an aggregate function? – chessosapiens Sep 22 '16 at 10:06
  • approxQuantile is a stat function, indeed it's not an aggregate function. – eliasah Sep 22 '16 at 10:08
  • thanks 1.is there any way to apply stat functions to group of data? 2. is it possible to create a python wrapper of Hive context? – chessosapiens Sep 22 '16 at 10:13
  • Im not sure. I need to test first. hiveContext should be available if I'm not mistaken in pyspark you just need the right build. – eliasah Sep 22 '16 at 10:16
  • sure i did, thanks , just the last question : by using Hivecontext and temporary table is our solution still scalable? – chessosapiens Sep 22 '16 at 10:32
  • of course, this doesn't mess with scalability. HiveContext is just a superset of a SQLContext with some extra features, UDFs and UDAFs – eliasah Sep 22 '16 at 11:32
  • getting `java.lang.NoSuchMethodError: sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner;` error whenever using `percentile_approx` method. – nim_10 Jun 09 '19 at 16:14
  • 1
    @Nabid check if your packages version are compatible (spark packages version must be the same) – eliasah Jun 10 '19 at 13:30