2

I have a pyspark data frame with multiple columns as follows:

name    col1    col2    col3
A        1        6       7
B        2        7       6
C        3        8       5
D        4        9       4
E        5        8       3

I want to create a new dataframe in pyspark by combining the column names and column values of col1, col2, col3 into two new columns, say, new_col and new_col_val, spread across rows:

enter image description here

I did the same in R using the following code:

df1 <- gather(df,new_col,new_col_val,-name)

I was thiking to create 3 separate dataframes which will contain each column from the original dataframe and then append them together but my data is having more than 2500k rows and around 60 columns. Creating multiple dataframes will be the worst idea. Can anyone please tell me how can I ddo this in pyspark?

user3642360
  • 762
  • 10
  • 23
  • 2
    is `melt` an option in pyspark? – acushner Feb 06 '20 at 18:08
  • 2
    there is no function like melt in pyspark but from this link https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe i could solve the problem. Thanks for giving me an idea to use melt :) – user3642360 Feb 06 '20 at 18:28
  • you can refer answer for similar question https://codereview.stackexchange.com/questions/200391/pyspark-code-that-turns-columns-into-rows – Prabhanj Feb 07 '20 at 07:16

3 Answers3

6

You can use unionAll to convert the columns to rows and lit can be used to specify the column name, as shown below,

from pyspark.sql.functions import lit

df2 = df.select(df.columns[0], lit(df.columns[1]).alias('new_col'),
                df[df.columns[1]].alias('new_col_val'))

for i in df.columns[2:]:
    df2 = df2.unionAll(df.select(df.columns[0], lit(i), df[i]))

Output:

+----+-------+-----------+
|name|new_col|new_col_val|
+----+-------+-----------+
|   A|   col1|          1|
|   B|   col1|          2|
|   C|   col1|          3|
|   D|   col1|          4|
|   E|   col1|          5|
|   A|   col2|          6|
|   B|   col2|          7|
|   C|   col2|          8|
|   D|   col2|          9|
|   E|   col2|          8|
|   A|   col3|          7|
|   B|   col3|          6|
|   C|   col3|          5|
|   D|   col3|          4|
|   E|   col3|          3|
+----+-------+-----------+

Note: All columns must be of the same datatype.

To check if the columns are of the same datatype,

if len(set(map(lambda x: x[-1], df.dtypes[1:]))) != 1:
    raise AssertionError("All columns must be of the same datatype")
noufel13
  • 653
  • 4
  • 4
3

Basically you are trying to unpivot the column and in spark, you can do this as follow:

from pyspark.sql.functions import expr
data.select("name",expr("stack(3,'col1',col1,'col2',col2,'col3',col3) as (new_col, new_col_val)")).show()
code.gsoni
  • 695
  • 3
  • 12
-1

It is relatively simple to do with melt functions.

sdf = spark.createDataFrame(pdf)
melt(sdf, id_vars=['name'], value_vars=['col1', 'col2', 'col3']).show()
Jay Kakadiya
  • 501
  • 1
  • 5
  • 12