0

I am having difficulty in converting an RDD of the follwing structure to a dataframe in spark using python.

df1=[['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),(itm22,6)]]

After converting, my dataframe should look like the following:

       usr1  usr2
itm1    2.0   NaN
itm2    NaN   3.0
itm22   NaN   6.0
itm3    3.0   5.0

I was initially thinking of coverting the above RDD structure to the following:

df1={'usr1': {'itm1': 2, 'itm3': 3}, 'usr2': {'itm2': 3, 'itm3': 5, 'itm22':6}}

Then use python's pandas module pand=pd.DataFrame(dat2) and then convert pandas dataframe back to a spark dataframe using spark_df = context.createDataFrame(pand). However, I beleive, by doing this, I am converting an RDD to a non-RDD object and then converting back to RDD, which is not correct. Can some please help me out with this problem?

Rkz
  • 1,237
  • 5
  • 16
  • 30
  • How is that different, excluding column choice, [from your previous question](http://stackoverflow.com/q/37514344/1560062)? – zero323 May 31 '16 at 19:02
  • please note that in my previous question I was more concerned with handling duplicate "itms" for the same user ( see "if there is one more count field like in the above tuple i.e., ('itm1',3) how to incorporate (or add) this value 3 into the final result of the contingency table (or entity-item matrix)". Since the answer that was given is still not clear (atleast from my perspective), if I am able to get a solution for this question, I can close down on the answer to my previous question. – Rkz May 31 '16 at 19:25

1 Answers1

2

With data like this:

rdd = sc.parallelize([
    ['usr1',('itm1',2),('itm3',3)], ['usr2',('itm2',3), ('itm3',5),('itm22',6)]
])

flatten the records:

def to_record(kvs):
    user, *vs = kvs  # For Python 2.x use standard indexing / splicing
    for item, value in vs:
        yield user, item, value

records = rdd.flatMap(to_record)

convert to DataFrame:

df = records.toDF(["user", "item", "value"])

pivot:

result = df.groupBy("item").pivot("user").sum()

result.show()
## +-----+----+----+
## | item|usr1|usr2|
## +-----+----+----+
## | itm1|   2|null|
## | itm2|null|   3|
## | itm3|   3|   5|
## |itm22|null|   6|
## +-----+----+----+

Note: Spark DataFrames are designed to handle long and relatively thin data. If you want to generate wide contingency table, DataFrames won't be useful, especially if data is dense and you want to keep separate column per feature.

zero323
  • 322,348
  • 103
  • 959
  • 935