I have a script with the below setup.
I am using:
1) Spark dataframes to pull data in 2) Converting to pandas dataframes after initial aggregatioin 3) Want to convert back to Spark for writing to HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Can you advise?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()