1

I am creating a temporary view of a JDBC query in PySpark 2.4. My data source is MS SQL Server 2017.

df = spark.read.format("jdbc").options(url="url",properties = { "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver" },dbtable="dbtable").load()
df.createOrReplaceTempView("df_temp")

Now I can query the temporary created table as

df_new = spark.sql("select * from df_temp where ...#standard query")

Now I want to write the above df_new as CSV in my local drive. One way (and currently I am doing the same) is to convert that df_new.toPandas(). And then save it as csv (standard df.to_csv('/path/file.csv' method). This method is creating a folder named file.csv and within this folder a csv file is generated with name that starts with part-00000-fd4c62bd-f208-4bd3-ae99-f81338b9ede1-c000.csv.

So if I run my .py file on a daily basis (using a scheduler like crontab) this is definitely not a good choice.

Questions:

  1. How can I standardized the name of the .csv file and on daily basis I can append new data to the same file?

  2. Is there any easy way to convert df_new to a Table or Spark DataFrame. So that I can apply df.coalesce(1).option("header","true").csv('/path/myfile.csv')?

Prathik Kini
  • 1,067
  • 11
  • 25
pythondumb
  • 1,187
  • 1
  • 15
  • 30
  • `df_new` is a data frame and therefore you can do `coalesce(1)` and save to the local path. – Nikhil Baby Apr 01 '19 at 06:26
  • I am getting error as 'NoneObject has no attribute write`. – pythondumb Apr 01 '19 at 07:31
  • @pythondumb, you can only specify the foldername where you will write the files to. You cannot control the name of the json file which you are creating. Refer to this link, https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv – Gladiator Apr 01 '19 at 07:34
  • @Gladiator: Assuming I use python equivalent techniques (as mentioned here, https://stackoverflow.com/questions/48652050/exporting-spark-dataframe-to-csv-with-header-and-specific-filename)In that case how can I append new data to the same file on daily basis? – pythondumb Apr 01 '19 at 07:47
  • I am not sure why exactly you want to have single file, since it affects parallelism and not the best practice. However, there are many techniques. You could try, 1. Move all current data to a temp location 2. Read this temp location and coalesce(1) and write to destination – Gladiator Apr 01 '19 at 07:59
  • @Gladiator: If I do not have a single file, then how can I append on a daily basis – pythondumb Apr 01 '19 at 09:06
  • You can use 'mode=append'. Refer to the link, https://stackoverflow.com/questions/41217724/how-to-append-to-a-csv-file-using-df-write-csv-in-pyspark – Gladiator Apr 01 '19 at 09:54
  • @Gladiator: Thanks!! Will try this as well. – pythondumb Apr 01 '19 at 12:49

1 Answers1

1

Use this to save to local filesystem

    #for Output in multiple files:
    df.write.option("header", "true").csv("/path/output.csv")

    #for output in single file:
    df.coalesce(1).write.option("header", "true").csv("/path/output.csv")