0

I have the following dataframe in spark.

T1 T2 T3 love 0.3 0.2 0.5 like 0.1 0.5 0.4 Sleep 0.1 0.1 0.8 Dance 0.3 0.4 0.3 I want to transpose the above matrix as love like sleep Dance T1 0.3 0.1 0.1 0.3 T2 0.2 0.5 0.1 0.4 T3 0.5 0.4 0.8 0.3 However, unlike Python, Scala does not seem to have an easy way to transpose.

What should I do?

Please Help me!

himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
Yoo
  • 3
  • 3

1 Answers1

0

I don't know Scala but will give a PySpark solution that can easily be converted to Scala:

First you melt the DF and then use pivot to get the form you want. The pivot operation is built-in, but we need to code the melt by hand. The idea is to create a "long" DF where the "T" column values are in a single column and there is an additional column telling us which of the "T" columns the value represents. An easy way to achieve this is using the underlying RDD and an application of flatMap. In what follows, I have taken the liberty of renaming the categorical in your original DF as "category":

# these are the T columns that we want to "melt"
old_columns=['T1', 'T2', 'T3']

# these columns will provide us with the columns of the tranposed DF
# I have kept this general-so you could have multiple categories
new_columns=['category']

# function to transform each RDD element to multiple elements
# for flatmapping
def reshape_row(row):
    melted_rows=[]
    for oldcol in old_columns:
        melted_rows.append([row[newcol] for newcol\
                            in new_columns]+[oldcol,row[oldcol]])
    return(melted_rows)

# melt the DF
df_melted=df.rdd.flatMap(reshape_row)\
.toDF(schema=new_columns+['T_category', 'T_value'])\

df_melted.show()

+--------+----------+-------+
|category|T_category|T_value|
+--------+----------+-------+
|    love|        T1|    0.3|
|    love|        T2|    0.2|
|    love|        T3|    0.5|
|    like|        T1|    0.1|
|    like|        T2|    0.5|
|    like|        T3|    0.4|
|   Sleep|        T1|    0.1|
|   Sleep|        T2|    0.1|
|   Sleep|        T3|    0.8|
|   Dance|        T1|    0.3|
|   Dance|        T2|    0.4|
|   Dance|        T3|    0.3|
+--------+----------+-------+

Now we can pivot to get the answer

df_melted.groupby("T_category")\
    .pivot("category")\
    .agg(f.max("T_value"))\
    .show()

+----------+-----+-----+----+----+
|T_category|Dance|Sleep|like|love|
+----------+-----+-----+----+----+
|        T3|  0.3|  0.8| 0.4| 0.5|
|        T1|  0.3|  0.1| 0.1| 0.3|
|        T2|  0.4|  0.1| 0.5| 0.2|
+----------+-----+-----+----+----+
ags29
  • 2,621
  • 1
  • 8
  • 14