3

Currently I am trying to convert an RDD to a contingency table in-order to use the pyspark.ml.clustering.KMeans module, which takes a dataframe as input.

When I do myrdd.take(K),(where K is some number) the structure looks as follows:

[[u'user1',('itm1',3),...,('itm2',1)], [u'user2',('itm1',7),..., ('itm2',4)],...,[u'usern',('itm2',2),...,('itm3',10)]]

Where each list contains an entity as the first element and the set of all items and their counts that was liked by this entity in the form of tuple.

Now, my objective is to convert the above into a spark DataFrame that resembles the following contingency table.

+----------+------+----+-----+
|entity    |itm1  |itm2|itm3 |
+----------+------+----+-----+
|    user1 |     3|   1|    0|
|    user2 |     7|   4|    0|
|    usern |     0|   2|   10|
+----------+------+----+-----+

I have used the df.stat.crosstab method as cited in the following link :

Statistical and Mathematical Functions with DataFrames in Apache Spark - 4. Cross Tabulation (Contingency Table)

and it is almost close to what I want.

But if there is one more count field like in the above tuple i.e., ('itm1',3) how to incorporate (or add) this value 3 into the final result of the contingency table (or entity-item matrix).

Of course, I take the long route by converting the above list of RDD into a matrix and write them as csv file and then read back as a DataFrame.

Is there a simpler way to do it using DataFrame ?

eliasah
  • 39,588
  • 11
  • 124
  • 154
Rkz
  • 1,237
  • 5
  • 16
  • 30
  • 1
    Possible duplicate of [Pivot Spark Dataframe](http://stackoverflow.com/questions/30244910/pivot-spark-dataframe) – zero323 May 30 '16 at 10:16
  • I don't agree with @zero323 on this question being a "direct" duplicate but the [link provided](http://stackoverflow.com/a/35676755/3415409) supplies an alternative way to do what you are seeking. – eliasah May 30 '16 at 11:53
  • The answer to this question can be seen from a more recent question of mine here [Convert RDD to Dataframe](http://stackoverflow.com/questions/37552052/convert-a-rdd-of-tuples-of-varying-sizes-to-a-dataframe-in-spark). Although this was originally intended to convert a RDD structure to a dataframe, the final part of the answer that uses pivoting, groupby and sum provides the solution to this question. If any one feels it's a duplicate, i will close the current question. – Rkz May 31 '16 at 20:40

1 Answers1

1

Convert RDD to pyspark dataframe by using createDataFrame() method.

Use show method after using crosstab method. Please refer following example:

cf = train_predictions.crosstab("prediction","label_col")

To display it in tabular format:

cf.show()

Output:

+--------------------+----+----+
|prediction_label_col| 0.0| 1.0|
+--------------------+----+----+
|                 1.0| 752|1723|
|                 0.0|1830| 759|
+--------------------+----+----+
Sayali Sonawane
  • 12,289
  • 5
  • 46
  • 47