0

I'm an amateur coder and very new to pyspark, I need to convert my python code to pyspark to run over millions of data and i'm stumped on simple things. I need to iterate few operations within blocks of data. I have done it using below code in python, but i cannot understand how to do the same in pyspark. can someone please help ?

Python code:

new_df = pd.DataFrame()
for blockingid,df in old_df.groupby(by=['blockingId']):
    nons = df[df.groupby("country")['country'].transform('size') > 1]
    new_df = pd.concat([new_df,nons],axis =0)

Sample block of input could be below:

Name blockingID. Country
name_1 block_1 country_1
name_2 block_1 country_2
name_3 block_2 country_2
name_4 block_2 country_2

The above code groups each block and removes the countries which have been repeated only once within that block.

I know about window functions in pyspark, but i don't understand how to apply whatever operations within the blocks ( Many more operations to do within blocks) append the output I got to another dataframe. Please help.

George
  • 1,196
  • 1
  • 2
  • 10
Megha_991
  • 122
  • 1
  • 11
  • 1
    Your python/pandas code should be: `old_df[old_df.duplicated(['blockingId', 'country'], keep=False)]` – Corralien Feb 14 '23 at 11:41
  • @Corralien the above code will delete the countries which have been repeated more than once. i was trying to acheive the opposite, retain the rows in blocks with countries repeated more than once. – Megha_991 Feb 14 '23 at 11:48
  • @Corralien never mind, i thought you were doing drop duplicates, but i see it now. Thank you.. But i need to know how to iterate within grouped blocks for other functions too.. – Megha_991 Feb 14 '23 at 11:49
  • you want the same output but with pyspark? – Corralien Feb 14 '23 at 11:51
  • yes, but more importantly, how to loop over grouped blocks. – Megha_991 Feb 14 '23 at 11:51
  • You should read https://stackoverflow.com/questions/40006395/applying-udfs-on-groupeddata-in-pyspark-with-functioning-python-example. – Corralien Feb 14 '23 at 11:56

0 Answers0