I'm an amateur coder and very new to pyspark, I need to convert my python code to pyspark to run over millions of data and i'm stumped on simple things. I need to iterate few operations within blocks of data. I have done it using below code in python, but i cannot understand how to do the same in pyspark. can someone please help ?
Python code:
new_df = pd.DataFrame()
for blockingid,df in old_df.groupby(by=['blockingId']):
nons = df[df.groupby("country")['country'].transform('size') > 1]
new_df = pd.concat([new_df,nons],axis =0)
Sample block of input could be below:
Name | blockingID. | Country |
---|---|---|
name_1 | block_1 | country_1 |
name_2 | block_1 | country_2 |
name_3 | block_2 | country_2 |
name_4 | block_2 | country_2 |
The above code groups each block and removes the countries which have been repeated only once within that block.
I know about window functions in pyspark, but i don't understand how to apply whatever operations within the blocks ( Many more operations to do within blocks) append the output I got to another dataframe. Please help.