Looping over blocks of data

Question

I'm an amateur coder and very new to pyspark, I need to convert my python code to pyspark to run over millions of data and i'm stumped on simple things. I need to iterate few operations within blocks of data. I have done it using below code in python, but i cannot understand how to do the same in pyspark. can someone please help ?

Python code:

new_df = pd.DataFrame()
for blockingid,df in old_df.groupby(by=['blockingId']):
    nons = df[df.groupby("country")['country'].transform('size') > 1]
    new_df = pd.concat([new_df,nons],axis =0)

Sample block of input could be below:

Name	blockingID.	Country
name_1	block_1	country_1
name_2	block_1	country_2
name_3	block_2	country_2
name_4	block_2	country_2

The above code groups each block and removes the countries which have been repeated only once within that block.

I know about window functions in pyspark, but i don't understand how to apply whatever operations within the blocks ( Many more operations to do within blocks) append the output I got to another dataframe. Please help.

Your python/pandas code should be: `old_df[old_df.duplicated(['blockingId', 'country'], keep=False)]` — Corralien, Feb 14 '23 at 11:41
@Corralien the above code will delete the countries which have been repeated more than once. i was trying to acheive the opposite, retain the rows in blocks with countries repeated more than once. — Megha_991, Feb 14 '23 at 11:48
@Corralien never mind, i thought you were doing drop duplicates, but i see it now. Thank you.. But i need to know how to iterate within grouped blocks for other functions too.. — Megha_991, Feb 14 '23 at 11:49
You should read https://stackoverflow.com/questions/40006395/applying-udfs-on-groupeddata-in-pyspark-with-functioning-python-example. — Corralien, Feb 14 '23 at 11:56

Looping over blocks of data

0 Answers0