I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python.
df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]
After converting, my dataframe should look like the following:
usr1 usr2
itm1 2.0 NaN
itm2 NaN 3.0
itm22 NaN 6.0
itm3 3.0 5.0
I was initially thinking of coverting the above RDD structure to the following:
df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}
Then use python's pandas module pand=pd.DataFrame(dat2)
and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand)
. However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. Can some please help me out with this problem?