bucketing a spark dataframe- pyspark

Question

I have a spark dataframe with column (age). I need to write a pyspark script to bucket the dataframe as a range of 10years of age( for ex age 11-20,age 21-30 ,...) and find the count of each age span entries .Need guidance on how to get through this

for ex :

I have the following dataframe

+-----+
|age  |  
+-----+
|   21|      
|   23|     
|   35|     
|   39|    
+-----+

after bucketing (expected)

+-----+------+
|age  | count|
+-----+------+
|21-30|    2 |    
|31-40|    2 |      
+-----+------+

Have you made any attempt? Please read this post on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide a [mcve]. — pault, Mar 27 '18 at 15:30

ernest_k · Accepted Answer · 2018-03-28T09:47:41.210

0

An easy way to run such a calculation would be to compute the histogram on the underlying RDD.

Given known age ranges (fortunately, this is easy to put together - here, using 1, 11, 21, etc.), it's fairly easy to produce the histogram:

hist = df.rdd\
  .map(lambda l: l['age'])\
  .histogram([1, 11, 21,31,41,51,61,71,81,91])

This will return a tuple with "age ranges" and their respective observation count, as:

([1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
  [10, 10, 10, 10, 10, 10, 10, 10, 11])

Then you can convert that back to a data frame using:

#Use zip to link age_ranges to their counts
countTuples = zip(hist[0], hist[1])
#make a list from that
ageList = list(map(lambda l: Row(age_range=l[0], count=l[1]), countTuples))
sc.parallelize(ageList).toDF()

For more info, check the histogram function's documentation in the RDD API

edited Mar 28 '18 at 09:47

answered Mar 27 '18 at 18:02

ernest_k

44,416
5
53
99

Hey @ernest kiwele thanks for your reply. I tried your method and it's giving me the desired output . Bt the answer is in the form of a tuple . I needed it as a datframe so that I can insert it into a hive table – Nightwing Mar 28 '18 at 09:28
Edited. It's just about creating a DF from an rdd of the results. – ernest_k Mar 28 '18 at 09:49
Using the given code ,I could save it into hive . Just out of curiosity,wanted to know how I could find the range count in the case where I don't know the max value of age (given age span range =10) – Nightwing Mar 28 '18 at 11:07
That's application logic. You might need to compute min/max to determine the ranges. The histogram method can also take just a number, and it will create the ranges itself... (`rdd.histogram(10)`) – ernest_k Mar 28 '18 at 12:02
After using rdd.housing(10) I get the output as Age_range Count 19.0 149 24.6 262 30.2 177 Which is wrong – Nightwing Mar 28 '18 at 12:50
Yes, it's not going to necessarily match your expected range (hence we passed in the pre-defined range) because it calculates dynamically; but it's an alternative to having to know static ranges beforehand. – ernest_k Mar 28 '18 at 12:58
Thanks for all your help :) – Nightwing Mar 28 '18 at 14:22

bucketing a spark dataframe- pyspark

1 Answers1

Linked