Dataframe transpose with pyspark in Apache Spark

Question

I have a dataframe df that have following structure:

+-----+-----+-----+-------+
|  s  |col_1|col_2|col_...|
+-----+-----+-----+-------+
| f1  |  0.0|  0.6|  ...  |
| f2  |  0.6|  0.7|  ...  |
| f3  |  0.5|  0.9|  ...  |
|  ...|  ...|  ...|  ...  |

And I want to calculate the transpose of this dataframe so it will be look like

+-------+-----+-----+-------+------+
|  s    | f1  | f2  | f3    |   ...|
+-------+-----+-----+-------+------+
|col_1  |  0.0|  0.6|  0.5  |   ...|
|col_2  |  0.6|  0.7|  0.9  |   ...|
|col_...|  ...|  ...|  ...  |   ...|

I tied this two solutions but it returns that dataframe has not the specified used method:

method 1:

 for x in df.columns:
    df = df.pivot(x)

method 2:

df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in df.items()]).toDF()

how can I fix this.

You could take a look here https://stackoverflow.com/questions/36215755/transpose-dataframe-using-spark-scala-without-using-pivot-function — Avishek Bhattacharya, Sep 27 '17 at 16:53

Alper t. Turker · Accepted Answer · 2017-09-27T17:03:14.567

19

If data is small enough to be transposed (not pivoted with aggregation) you can just convert it to Pandas DataFrame:

df = sc.parallelize([
    ("f1", 0.0, 0.6, 0.5),
    ("f2", 0.6, 0.7, 0.9)]).toDF(["s", "col_1", "col_2", "col_3"])

df.toPandas().set_index("s").transpose()
s       f1   f2
col_1  0.0  0.6
col_2  0.6  0.7
col_3  0.5  0.9

If it is to large for this, Spark won't help. Spark DataFrame distributes data by row (although locally uses columnar storage), therefore size of a individual rows is limited to local memory.

edited Sep 27 '17 at 17:03

answered Sep 27 '17 at 16:57

Alper t. Turker

34,230
9
83
115

4

You might want to reset index before converting it back to Spark Dataframe in order to not lose columns names in the rows. You can do it by using the command 'reset_index'. E.g: df.toPandas().set_index("s").transpose().reset_index() – lfvv Jun 11 '18 at 20:00
How can we convert the result to dataframe again ? – Aspirant Aug 02 '18 at 11:07
2

@Aspirant `spark.createDataFrame(result)` – Alper t. Turker Aug 02 '18 at 11:08

s510 · Answer 2 · 2021-12-22T12:01:28.700

2

You can try Koalas by databricks. Koalas is a similar to Pandas but made for distributed processing and available in Pyspark (atleast from 3.0.0).

kdf = df.to_koalas()
kdf_t = kdf.transpose()
df_T = kdf_t.to_spark()

edit: to efficiently access Koalas you need to define partitions, otherwise there can be serious performance degradation.

edited Dec 22 '21 at 12:01

answered Dec 22 '21 at 11:54

s510

2,271
11
18

Dataframe transpose with pyspark in Apache Spark

2 Answers2

Linked

Related