0

I have a spark dataframe with column (age). I need to write a pyspark script to bucket the dataframe as a range of 10years of age( for ex age 11-20,age 21-30 ,...) and find the count of each age span entries .Need guidance on how to get through this

for ex :

I have the following dataframe

+-----+
|age  |  
+-----+
|   21|      
|   23|     
|   35|     
|   39|    
+-----+

after bucketing (expected)

+-----+------+
|age  | count|
+-----+------+
|21-30|    2 |    
|31-40|    2 |      
+-----+------+
Nightwing
  • 37
  • 6
  • Have you made any attempt? Please read this post on [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide a [mcve]. – pault Mar 27 '18 at 15:30

1 Answers1

0

An easy way to run such a calculation would be to compute the histogram on the underlying RDD.

Given known age ranges (fortunately, this is easy to put together - here, using 1, 11, 21, etc.), it's fairly easy to produce the histogram:

hist = df.rdd\
  .map(lambda l: l['age'])\
  .histogram([1, 11, 21,31,41,51,61,71,81,91])

This will return a tuple with "age ranges" and their respective observation count, as:

([1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
  [10, 10, 10, 10, 10, 10, 10, 10, 11])

Then you can convert that back to a data frame using:

#Use zip to link age_ranges to their counts
countTuples = zip(hist[0], hist[1])
#make a list from that
ageList = list(map(lambda l: Row(age_range=l[0], count=l[1]), countTuples))
sc.parallelize(ageList).toDF()

For more info, check the histogram function's documentation in the RDD API

ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • Hey @ernest kiwele thanks for your reply. I tried your method and it's giving me the desired output . Bt the answer is in the form of a tuple . I needed it as a datframe so that I can insert it into a hive table – Nightwing Mar 28 '18 at 09:28
  • Edited. It's just about creating a DF from an rdd of the results. – ernest_k Mar 28 '18 at 09:49
  • Using the given code ,I could save it into hive . Just out of curiosity,wanted to know how I could find the range count in the case where I don't know the max value of age (given age span range =10) – Nightwing Mar 28 '18 at 11:07
  • That's application logic. You might need to compute min/max to determine the ranges. The histogram method can also take just a number, and it will create the ranges itself... (`rdd.histogram(10)`) – ernest_k Mar 28 '18 at 12:02
  • After using rdd.housing(10) I get the output as Age_range Count 19.0 149 24.6 262 30.2 177 Which is wrong – Nightwing Mar 28 '18 at 12:50
  • Yes, it's not going to necessarily match your expected range (hence we passed in the pre-defined range) because it calculates dynamically; but it's an alternative to having to know static ranges beforehand. – ernest_k Mar 28 '18 at 12:58
  • Thanks for all your help :) – Nightwing Mar 28 '18 at 14:22