While using partitionby() in pyspark, what approach should I follow to write csv files in one single folder rather than multiple folders ? Any suggested solution ?
Code
from pyspark.sql import SparkSession
from pyspark import SparkConf
import pyodbc
appName = "PySpark Teradata Example"
master = "local"
conf = SparkConf() # create the configuration
conf.set("spark.repl.local.jars", "terajdbc4.jar")
conf.set("spark.executor.extraClassPath", "terajdbc4.jar")
conf.set("spark.driver.extraClassPath", "terajdbc4.jar")
spark = SparkSession.builder \
.config(conf=conf) \
.appName(appName) \
.master(master) \
.getOrCreate()
#input table name
table = "my_table_1"
df =spark.read \
.format('jdbc') \
.option('url', 'jdbc:teradata://xxx.xxx.xx.xx') \
.option('user', 'dbc') \
.option('password', 'dbc') \
.option('driver', 'com.teradata.jdbc.TeraDriver') \
.option('STRICT_NAMES', 'OFF') \
.option('query',"Select eno, CAST(edata.asJSONText() AS VARCHAR(32000)) as edata from AdventureWorksDW."+table)\
.load()
df.show()
df = df.withColumn("id_tmp", F.col(df.columns[0]) % 4).orderBy("id_tmp")
df.coalesce(4)
.write \
.option("header",True) \
.mode("overwrite") \
.partitionBy("id_tmp") \
.option("sep","|")\
.format("csv") \
.save("C:\\Data\\"+table+"\\")
It is giving multiple folders with multiple CSV as an output. How to download it to a single folder ? Also, how can we change the name of the file while writing it to the folder ?