Pyspark - Combine Multiple Columns of Data into a Single Column Spread Out Across Rows

Question

I have a pyspark data frame with multiple columns as follows:

name    col1    col2    col3
A        1        6       7
B        2        7       6
C        3        8       5
D        4        9       4
E        5        8       3

I want to create a new dataframe in pyspark by combining the column names and column values of col1, col2, col3 into two new columns, say, new_col and new_col_val, spread across rows:

I did the same in R using the following code:

df1 <- gather(df,new_col,new_col_val,-name)

I was thiking to create 3 separate dataframes which will contain each column from the original dataframe and then append them together but my data is having more than 2500k rows and around 60 columns. Creating multiple dataframes will be the worst idea. Can anyone please tell me how can I ddo this in pyspark?

there is no function like melt in pyspark but from this link https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe i could solve the problem. Thanks for giving me an idea to use melt :) — user3642360, Feb 06 '20 at 18:28
you can refer answer for similar question https://codereview.stackexchange.com/questions/200391/pyspark-code-that-turns-columns-into-rows — Prabhanj, Feb 07 '20 at 07:16

score 6 · Accepted Answer · answered Feb 06 '20 at 19:01

You can use unionAll to convert the columns to rows and lit can be used to specify the column name, as shown below,

from pyspark.sql.functions import lit

df2 = df.select(df.columns[0], lit(df.columns[1]).alias('new_col'),
                df[df.columns[1]].alias('new_col_val'))

for i in df.columns[2:]:
    df2 = df2.unionAll(df.select(df.columns[0], lit(i), df[i]))

Output:

+----+-------+-----------+
|name|new_col|new_col_val|
+----+-------+-----------+
|   A|   col1|          1|
|   B|   col1|          2|
|   C|   col1|          3|
|   D|   col1|          4|
|   E|   col1|          5|
|   A|   col2|          6|
|   B|   col2|          7|
|   C|   col2|          8|
|   D|   col2|          9|
|   E|   col2|          8|
|   A|   col3|          7|
|   B|   col3|          6|
|   C|   col3|          5|
|   D|   col3|          4|
|   E|   col3|          3|
+----+-------+-----------+

Note: All columns must be of the same datatype.

To check if the columns are of the same datatype,

if len(set(map(lambda x: x[-1], df.dtypes[1:]))) != 1:
    raise AssertionError("All columns must be of the same datatype")

score 3 · Answer 2 · answered Feb 07 '20 at 00:07

3

Basically you are trying to unpivot the column and in spark, you can do this as follow:

from pyspark.sql.functions import expr
data.select("name",expr("stack(3,'col1',col1,'col2',col2,'col3',col3) as (new_col, new_col_val)")).show()

answered Feb 07 '20 at 00:07

code.gsoni

695
3
12

score -1 · Answer 3 · answered Feb 06 '20 at 18:55

-1

It is relatively simple to do with melt functions.

sdf = spark.createDataFrame(pdf)
melt(sdf, id_vars=['name'], value_vars=['col1', 'col2', 'col3']).show()

answered Feb 06 '20 at 18:55

Jay Kakadiya

501
1
5
12

Pyspark - Combine Multiple Columns of Data into a Single Column Spread Out Across Rows

3 Answers3