I have a pyspark dataframe with millions of records and hundreds of columns (an example below)
clm1, clm2, clm3
code1,xyz,123
code2,abc,345
code1,qwe,456
I want to do divide it into multiple dataframes based on clm1 i.e. separate dataframe for clm1=code1
and separate dataframe for clm1=code2
and so on and then process them and write the result in separate files. I want to perform this operation in parallel to speed up the process. I am using below code:
S1 = myclass("code1")
S2 = myclass("code2")
t1 = multiprocessing.Process(target=S1.processdata,args=(df,))
t2 = multiprocessing.Process(target=S2.processdata,args=(df,))
t1.start()
t2.start()
t1.join()
t2.join()
but i am getting below error
Method __getstate__([]) does not exist
If I use threading.Thread
instead of multiprocessing.Process
it is working fine but that doesn't seem to reduce the overall time