How to Transpose DataFrame Without Aggregation in Spark using python

Question

Here is the Input dataframe

+-----------+--------+--------+--------+--------+
|COLUMN_NAME| VALUE1 | VALUE2 | VALUE3 | VALUEN |
+-----------+--------+--------+--------+--------+
|col1       | val11  | val21  | val31  | valN1  |
|col2       | val12  | val22  | val32  | valN2  |
|col3       | val13  | val23  | val33  | valN3  |
|col4       | val14  | val24  | val34  | valN4  |
|col5       | val15  | val25  | val35  | valN5  |
+-----------+--------+--------+--------+--------+

I would like to transpose as mentioned below:

+------+-------+------+-------+------+
|col1  | col2  |col3  | col4  |col5  |
+------+-------+------+-------+------+
|val11 | val12 |val13 | val14 |val15 |
|val21 | val22 |val23 | val24 |val25 |
|val31 | val32 |val33 | val34 |val35 |
|valN1 | valN2 |valN3 | valN4 |valN5 |
+------+-------+------+-------+------+

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Sep 14 '21 at 08:10

score 0 · Answer 1 · edited Sep 08 '21 at 14:56

0

Your question isn't very clear, but if your dataframe isn't too big, you can use Pandas Melt function

df_pandas = df.toPandas()
df_pandas = df_pandas.melt(id_vars=['COLUMN_NAME'],value_vars=['VALUE_VARS'])
df_spark = spark.createDataFrame(df_pandas)

If your df is very large, I'd use koalas melt function

Spark also has stack function which is less intuitive.

edited Sep 08 '21 at 14:56

Dharman

30,962
25
85
135

answered Sep 08 '21 at 14:50

Assaf Segev

381
1
7

Thanks for your suggestion. The challenge here is to pass the n number of value_vars to the melt command. I was able to use the scala code from https://stackoverflow.com/a/49403834/10736536 and it was working fine, I would like to convert it to pyspark code, – Abhy Sep 10 '21 at 00:00
The number of value_vars varies. One of my input dataframes has 5 value_vars and another dataframe has 7 value_vars. That's why I am trying to use a common function that should handle any number of value_vars. Hope this helps. Thanks for your help. – Abhy Sep 10 '21 at 12:11
@Abhy why won't we fetch the var cols like this -> `cols = df.columns` and then remove the id_vars, passing to melt function the list of value cols? `cols.remove('id_vars')` would only leave the value vars. – Assaf Segev Sep 10 '21 at 19:02

How to Transpose DataFrame Without Aggregation in Spark using python

1 Answers1