0

I have to perform a transform operation on pyspark dataframe which is similar to pandas transform. I got below pyspark-dataframe by applying .summary() operation on dataframe.

value   col_a       col_b           col_c
count   14.000000   14.000000       14.000000   
mean    9.928571    3189.785714     155210.857143   
std     7.086979    1413.286904     76682.259154    
min     0.000000    0.000000        0.000000    
25%     5.500000    3152.500000     129994.750000   
50%     9.500000    3596.000000     158677.500000   
75%     12.500000   4007.250000     210596.750000   
max     23.000000   4543.000000     256496.000000   

And I want to convert rows into columns and columns to rows. Like below

value   count   mean             std         min 25%        50%      75%         max
col_a   14.0    9.928571        7.086979     0.0  5.50      9.5      12.50      23.0
col_b   14.0    3189.785714     1413.286904  0.0  3152.50   3596.0   4007.25    4543.0
col_c   14.0    155210.857143   76682.259154 0.0  129994.75 158677.5 210596.75  256496.0

Also, columns before transform are not fixed.For problem explanation i have taken 3 columns col_a, col_b, col_c. But in a real scenario, it is up to 10k.

In pandas same I can achieve by doing like below:-

     transformed_df = df.T
vishwajeet Mane
  • 344
  • 1
  • 3
  • 13
  • Does this answer your question? https://stackoverflow.com/questions/46453058/dataframe-transpose-with-pyspark-in-apache-spark – mck Apr 20 '21 at 09:01
  • My dataframe is pretty huge to convert it into pandas and it will take too much time and resources to do so. – vishwajeet Mane Apr 20 '21 at 09:08
  • Why are you needing the transpose - I wonder if you can keep the data as is – doctorlove Apr 20 '21 at 09:30
  • I want to store it on a hive table. As I mentioned columns before transpose is around 10K.That why I want to transpose it first and store it in a hive with 8 columns only. @doctorlove – vishwajeet Mane Apr 20 '21 at 09:35
  • Might be some ideas in here: https://stackoverflow.com/questions/36215755/how-to-transpose-dataframe-in-spark-1-5-no-pivot-operator-available – doctorlove Apr 20 '21 at 09:38
  • 10K columns and 8 rows isn't really that large. I think you can do it in pandas? – mck Apr 20 '21 at 10:54
  • After this, I have to perform soo many operations after the conversion of the frame. And Spark-to-Pandas is time consume will affect driver memory @mck – vishwajeet Mane Apr 20 '21 at 11:06
  • Have you tried using pivot https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=pivot#pyspark.sql.GroupedData.pivot – NGrech Apr 20 '21 at 11:48
  • Pivot won't work here. @NGrech – vishwajeet Mane Apr 20 '21 at 12:12

1 Answers1

0

Since pivot is not available according to your comment, my alternative solution is a bit lengthy

cols = ['col_a', 'col_b', 'col_c']
metrics = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']

(df
    .withColumn('keys', F.array([F.lit(c) for c in cols]))
    .withColumn('values', F.array([F.col(c) for c in cols]))
    .withColumn('maps', F.map_from_arrays('keys', 'values'))
    .select(F.col('value').alias('metric'), F.explode('maps').alias('col', 'value'))
    .groupBy('col')
    .agg(
        F.collect_list('metric').alias('keys'),
        F.collect_list('value').alias('values')
    )
    .withColumn('maps', F.map_from_arrays('keys', 'values'))
    .select(['col'] + [f'maps.{c}' for c in metrics])
    .show(10, False)
)

# Output
+-----+-----+-------------+------------+---+---------+--------+---------+--------+
|col  |count|mean         |std         |min|25%      |50%     |75%      |max     |
+-----+-----+-------------+------------+---+---------+--------+---------+--------+
|col_b|14.0 |3189.785714  |1413.286904 |0.0|3152.5   |3596.0  |4007.25  |4543.0  |
|col_c|14.0 |155210.857143|76682.259154|0.0|129994.75|158677.5|210596.75|256496.0|
|col_a|14.0 |9.928571     |7.086979    |0.0|5.5      |9.5     |12.5     |23.0    |
+-----+-----+-------------+------------+---+---------+--------+---------+--------+
pltc
  • 5,836
  • 1
  • 13
  • 31