Flip a Dataframe

Question

I am working on Databricks using Python 2.

I have a PySpark dataframe like:

|Germany|USA|UAE|Turkey|Canada...
|5      | 3 |3  |42    | 12..

Which, as you can see, consists of hundreds of columns and only one single row.

I want to flip it in a way such that I get:

Name   | Views
--------------
Germany| 5
USA    | 3 
UAE    | 3
Turkey | 42
Canada | 12

How would I approach this?

Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.

Edit 2: Example code:

dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()

All those answers are depending on the fact that I know my columns. I don't. I have hundreds of them so I can't write them down. One of the functions in these questions also needed to pass a specific column, but I can't do that. — Madno, May 28 '18 at 12:14
_I have hundreds of columns so I can't write them down_ - Then iterate over `df.columns`... — Alper t. Turker, May 28 '18 at 12:58

score 1 · Answer 1 · answered May 28 '18 at 12:48

This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"

A working example would be as follows:

import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable 

try:
    sc
except NameError:
    sc = ps.SparkContext()
    sqlContext = SQLContext(sc)

# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
        df: DataFrame, 
        id_vars: Iterable[str], value_vars: Iterable[str], 
        var_name: str="variable", value_name: str="value") -> DataFrame:
    """Convert :class:`DataFrame` from wide to long format."""

    # Create array<struct<variable: str, value: ...>>
    _vars_and_vals = array(*(
        struct(lit(c).alias(var_name), col(c).alias(value_name)) 
        for c in value_vars))

    # Add to the DataFrame and explode
    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [
            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

# Sample data
df1 = sqlContext.createDataFrame(
    [(0,1,2,3,4)],
    ("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()

Output:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   0|   1|   2|   3|   4|
+----+----+----+----+----+

+--------+-----+
|variable|value|
+--------+-----+
|    col1|    0|
|    col2|    1|
|    col3|    2|
|    col4|    3|
|    col5|    4|
+--------+-----+

Hope this helps.

I am getting syntax error at the def melt line, like does this even work in Python? (to define like df: Dataframe,...?). — Madno, May 28 '18 at 13:13
I think that feature is [new in Python 3.5](https://stackoverflow.com/questions/32557920/what-are-type-hints-in-python-3-5?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). Try removing the type hint and see if it works for you. — Florian, May 28 '18 at 13:15
It didn't work for me. I removed the type of arguments passwed to the function, but then I got a message that melt() was expecting 5 arguments and that I passed only 3 from my call. — Madno, May 28 '18 at 13:25

Narendra Prasath · Accepted Answer · 2018-05-28T15:34:16.323

-3

You can convert pyspark dataframe to pandas dataframe and use Transpose function

%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF() 
dt.show()

dt.toPandas().transpose()

Output:

Other solution

dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF() 
df.show()

a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)

edited May 28 '18 at 15:34

answered May 28 '18 at 11:59

Narendra Prasath

1,501
1
10
20

The problem is that I have hundreds of columns and I can't write them all down like that. I don't even know most of them. Edit: Also, is this is in Pandas? My dataframe is in pyspark – Madno May 28 '18 at 12:01
If you have dataframe object, directly transpose it. What is the problem your facing in that? – Narendra Prasath May 28 '18 at 12:03
AttributeError: 'DataFrame' object has no attribute 'T'. This is a spark dataframe. – Madno May 28 '18 at 12:04
Refer this https://stackoverflow.com/questions/37864222/transpose-column-to-row-with-spark – Narendra Prasath May 28 '18 at 12:06
2

Calling toPandas() will do a collect under the hood and this won't scale. – eliasah May 28 '18 at 14:33
2

This solution is also not good. Collecting data brings all the data to the driver and this may lead to an OOME. It won’t scale neither. – eliasah May 28 '18 at 19:08
ping @user8371915 what do you think ? – eliasah May 29 '18 at 07:18
I guess, Storing all data into memory for 100 columns * single row dataframe. it works – Narendra Prasath May 29 '18 at 07:31

Flip a Dataframe

2 Answers2