-2

Here is the Input dataframe

+-----------+--------+--------+--------+--------+
|COLUMN_NAME| VALUE1 | VALUE2 | VALUE3 | VALUEN |
+-----------+--------+--------+--------+--------+
|col1       | val11  | val21  | val31  | valN1  |
|col2       | val12  | val22  | val32  | valN2  |
|col3       | val13  | val23  | val33  | valN3  |
|col4       | val14  | val24  | val34  | valN4  |
|col5       | val15  | val25  | val35  | valN5  |
+-----------+--------+--------+--------+--------+

I would like to transpose as mentioned below:

+------+-------+------+-------+------+
|col1  | col2  |col3  | col4  |col5  |
+------+-------+------+-------+------+
|val11 | val12 |val13 | val14 |val15 |
|val21 | val22 |val23 | val24 |val25 |
|val31 | val32 |val33 | val34 |val35 |
|valN1 | valN2 |valN3 | valN4 |valN5 |
+------+-------+------+-------+------+
Mohana B C
  • 5,021
  • 1
  • 9
  • 28
Abhy
  • 61
  • 5
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Sep 14 '21 at 08:10

1 Answers1

0

Your question isn't very clear, but if your dataframe isn't too big, you can use Pandas Melt function

df_pandas = df.toPandas()
df_pandas = df_pandas.melt(id_vars=['COLUMN_NAME'],value_vars=['VALUE_VARS'])
df_spark = spark.createDataFrame(df_pandas)

If your df is very large, I'd use koalas melt function

Spark also has stack function which is less intuitive.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Assaf Segev
  • 381
  • 1
  • 7
  • Thanks for your suggestion. The challenge here is to pass the n number of value_vars to the melt command. I was able to use the scala code from https://stackoverflow.com/a/49403834/10736536 and it was working fine, I would like to convert it to pyspark code, – Abhy Sep 10 '21 at 00:00
  • The number of value_vars varies. One of my input dataframes has 5 value_vars and another dataframe has 7 value_vars. That's why I am trying to use a common function that should handle any number of value_vars. Hope this helps. Thanks for your help. – Abhy Sep 10 '21 at 12:11
  • @Abhy why won't we fetch the var cols like this -> `cols = df.columns` and then remove the id_vars, passing to melt function the list of value cols? `cols.remove('id_vars')` would only leave the value vars. – Assaf Segev Sep 10 '21 at 19:02