I am trying to use Pandas "apply" inside the parallelized code but the "apply" is not working at all. Can we use "apply" inside the code which gets distributed to the executors while using Spark (parallelize on RDD)?
Code:
def testApply(k):
return pd.DataFrame({'col1':k,'col2':[k*2]*5})
def testExec(x):
df=pd.DataFrame({'col1':range(0,10)})
ddf=pd.DataFrame(columns=['col1', 'col2'])
##In my case the below line doesn't get executed at all
res= df.apply(lambda row: testApply(row.pblkGroup) if row.pblkGroup%2==0 else pd.DataFrame(), axis=1)
list1=[1,2,3,4]
sc=SparkContext.getOrCreate()
testRdd= sc.parallelize(list1)
output=testRdd.map(lambda x: testExec(x)).collect()