I am creating a temporary view of a JDBC
query in PySpark 2.4. My data source is MS SQL Server 2017
.
df = spark.read.format("jdbc").options(url="url",properties = { "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver" },dbtable="dbtable").load()
df.createOrReplaceTempView("df_temp")
Now I can query the temporary created table as
df_new = spark.sql("select * from df_temp where ...#standard query")
Now I want to write the above df_new
as CSV
in my local drive. One way (and currently I am doing the same) is to convert that df_new.toPandas()
. And then save it as csv
(standard df.to_csv('/path/file.csv'
method). This method is creating a folder named file.csv
and within this folder a csv file is generated with name that starts with part-00000-fd4c62bd-f208-4bd3-ae99-f81338b9ede1-c000.csv
.
So if I run my .py
file on a daily basis (using a scheduler like crontab
) this is definitely not a good choice.
Questions:
How can I standardized the name of the .csv file and on daily basis I can append new data to the same file?
Is there any easy way to convert
df_new
to a Table orSpark DataFrame
. So that I can applydf.coalesce(1).option("header","true").csv('/path/myfile.csv')
?