3

I am working on Databricks using Python 2.

I have a PySpark dataframe like:

|Germany|USA|UAE|Turkey|Canada...
|5      | 3 |3  |42    | 12..

Which, as you can see, consists of hundreds of columns and only one single row.

I want to flip it in a way such that I get:

Name   | Views
--------------
Germany| 5
USA    | 3 
UAE    | 3
Turkey | 42
Canada | 12

How would I approach this?

Edit: I have hundreds of columns so I can't write them down. I don't know most of them, but they just exist there. I can't use the columns names in this process.

Edit 2: Example code:

dicttest = {'Germany': 5, 'USA': 20, 'Turkey': 15}
rdd=sc.parallelize([dicttest]).toDF()
df = rdd.toPandas().transpose()
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Madno
  • 910
  • 2
  • 12
  • 27
  • All those answers are depending on the fact that I know my columns. I don't. I have hundreds of them so I can't write them down. One of the functions in these questions also needed to pass a specific column, but I can't do that. – Madno May 28 '18 at 12:14
  • _I have hundreds of columns so I can't write them down_ - Then iterate over `df.columns`... – Alper t. Turker May 28 '18 at 12:58

2 Answers2

1

This answer might be a bit 'overkill' but it does not use Pandas or collect anything to the driver. It will also work when you have multiple rows. We can just pass an empty list to the melt function from "How to melt Spark DataFrame?"

A working example would be as follows:

import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable 

try:
    sc
except NameError:
    sc = ps.SparkContext()
    sqlContext = SQLContext(sc)

# From https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe
def melt(
        df: DataFrame, 
        id_vars: Iterable[str], value_vars: Iterable[str], 
        var_name: str="variable", value_name: str="value") -> DataFrame:
    """Convert :class:`DataFrame` from wide to long format."""

    # Create array<struct<variable: str, value: ...>>
    _vars_and_vals = array(*(
        struct(lit(c).alias(var_name), col(c).alias(value_name)) 
        for c in value_vars))

    # Add to the DataFrame and explode
    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [
            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

# Sample data
df1 = sqlContext.createDataFrame(
    [(0,1,2,3,4)],
    ("col1", "col2",'col3','col4','col5'))
df1.show()
df2 = melt(df1,id_vars=[],value_vars=df1.columns)
df2.show()

Output:

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   0|   1|   2|   3|   4|
+----+----+----+----+----+

+--------+-----+
|variable|value|
+--------+-----+
|    col1|    0|
|    col2|    1|
|    col3|    2|
|    col4|    3|
|    col5|    4|
+--------+-----+

Hope this helps.

Florian
  • 24,425
  • 4
  • 49
  • 80
  • I am getting syntax error at the def melt line, like does this even work in Python? (to define like df: Dataframe,...?). – Madno May 28 '18 at 13:13
  • I think that feature is [new in Python 3.5](https://stackoverflow.com/questions/32557920/what-are-type-hints-in-python-3-5?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). Try removing the type hint and see if it works for you. – Florian May 28 '18 at 13:15
  • It didn't work for me. I removed the type of arguments passwed to the function, but then I got a message that melt() was expecting 5 arguments and that I passed only 3 from my call. – Madno May 28 '18 at 13:25
-3

You can convert pyspark dataframe to pandas dataframe and use Transpose function

%pyspark
import numpy as np
from pyspark.sql import SQLContext
from pyspark.sql.functions import lit

dt1 = [[1,2,4,5,6,7]]
dt = sc.parallelize(dt1).toDF() 
dt.show()

enter image description here

dt.toPandas().transpose()

Output: enter image description here

Other solution

dt2 = [{"1":1,"2":2,"4":4,"5":5,"6":29,"7":8}]
df = sc.parallelize(dt2).toDF() 
df.show()

a = [{"name":i,"value":df.select(i).collect()[0][0]} for i in df.columns ]
df1 = sc.parallelize(a).toDF()
print(df1)
Narendra Prasath
  • 1,501
  • 1
  • 10
  • 20