Change structure of Pyspark DataFrame

Question

I am trying to change the structure of a PySpark DataFrame.

It is somewhat a transpose representaion.

Can anyone help..?

I tried few methods like explode() and a custom transpose function, but failed to achieve this completely.

Source DF is below-

    num name    x   y   z
    1   abc    10   20  30
    2   def    40   50  60
    3   ghi    70   80  90

I want to change it to the below form:

    num name   variable value
    1   abc    x        10
    2   def    y        20
    3   ghi    z        30
    1   abc    x        40
    2   def    y        50
    3   ghi    z        60
    1   abc    x        70
    2   def    y        80
    3   ghi    z        90

This question gives an easy to understand answer in this case: https://stackoverflow.com/questions/42465568/unpivot-in-spark-sql-pyspark — Shaido, Aug 07 '19 at 06:17

score 0 · Answer 1 · answered Aug 07 '19 at 05:59

def to_long(df, by, transformCol):
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    kvs = explode(array([
      struct(lit(c).alias("key"), col(c).alias("val")) for c in transformCol
    ])).alias("kvs")
    return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])

to_long(df, [index_column], column_list_to_explode).show()

This will solve the issue. In the to_long function, send the dataframe, the column you want to use as the index and then the columns you want to transpose as a list.

Change structure of Pyspark DataFrame

1 Answers1