0

I'm newbie into PySpark and I have the following task where I struggle. I have tried few approached, but none of them worked properly. The data is as follows:

id|numb_of_count|
1|3|
2|5|
3|6|
4|2|
5|0|
6|15|
7|8|
8|99|

I want to achieve the following result:

id|numb_of_count|banding|
1|3|3-5|
2|5|3-5| 
3|6|6-10|
4|2|2|
5|0|0|
6|15|+11|
7|8|6-10|
8|99|+11|

How this could be possible achieved in the most efficient way, due to I have a large dataset?

default_settings
  • 440
  • 1
  • 5
  • 10
  • Seems like you want a [series of `if`/`else`](https://stackoverflow.com/a/39048475/5858851) statements. – pault Jul 09 '18 at 14:15
  • Hi @pault, could suggest me with an example code? – default_settings Jul 09 '18 at 15:07
  • You'll have to fill in the logic for the conditions yourself, but you need something like `df.withColumn('banding', when(col('numb_of_count') == 0, "0").when(condition).when(condition).otherwise("+11"))` – pault Jul 09 '18 at 15:11

1 Answers1

2

In pyspark when/otherwise are equivalent of if/else. If df is your actual dataframe then:

new_df = df.withColumn('banding', when(col('numb_of_count') <3,col('numb_of_count')).when(col('numb_of_count') <=5 , '3-5').when(col('numb_of_count') <= 10, '6-10').otherwise('+11'))

df.withColumn

df.withColumn adds a new column to the frame with first argument as name of new column. more info here

when/otherwise

analogous to if/else, more info here

This is an excellent answer to learn more about when/otherwise.

Rahul Chawla
  • 1,048
  • 10
  • 15