Convert a pandas dataframe to a PySpark dataframe

Question

I have a script with the below setup.

I am using:

1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS

The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.

Can you advise?

from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd

def create_session(appname):
    spark_session = SparkSession\
        .builder\
        .appName(appname)\
        .master('yarn')\
        .config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
        .enableHiveSupport()\
        .getOrCreate()
    return spark_session
### START MAIN ###
if __name__ == '__main__':
    spark_session = create_session('testing_files')

I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe

df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()

Thanks Pault - unfortunately, that solution doesn't work - I've added the attempted & failed code at the bottom. I'm not entirely sure what the issue is — kikee1222, Oct 23 '18 at 18:23

score 56 · Accepted Answer · edited Mar 07 '19 at 03:42

56

Here we go:

# Spark to Pandas
df_pd = df.toPandas()

# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)

edited Mar 07 '19 at 03:42

Oran

877
7
13

answered Oct 23 '18 at 13:05

Andrea

4,262
4
37
56

2

Thanks for your reply. I've edited the post to show trying this - it doesn't error, but it doesn't provide any output – kikee1222 Oct 23 '18 at 18:16
1

For those who wants to read more, [this Medium article](https://medium.com/hashmapinc/5-steps-to-converting-python-jobs-to-pyspark-4b9988ad027a) is useful. – Safwan Mar 25 '21 at 11:59

Convert a pandas dataframe to a PySpark dataframe

1 Answers1

Linked