0

Following is my source data,

 Name |Date      |
+-----+----------+
|Azure|2018-07-26|
|AWS  |2018-07-27|
|GCP  |2018-07-28|
|GCP  |2018-07-28|

I have partitioned the data using Date column,

udl_file_df_read.write.format("csv").partitionBy("Date").mode("append").save(outputPath)

val events = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").load(outputPath)

events.show()

The output column names are (c0,Date). I am not sure why the original column name is missing and how do I retain the column names?

Note This is not a duplicate question because of the below reasons Here columns other than partition columns are renamed as c0 and specifying base-path in option doesn't work.

prady
  • 563
  • 4
  • 9
  • 24

2 Answers2

3

You get column names like c0 because CSV format as used in the question doesn't preserve column names.

You can try writing with

udl_file_df_read
  .write.
  .option("header", "true")
  ...

and similarly read

spark
  .read
  .option("header", "true")
  • Thanks, but my requirement is I need the output format as CSV. Is there any other way to retain the column names when we use csv as format with partitioning ? – prady Jul 31 '18 at 12:37
0

I was able to retain the schema by setting the option header to true when I write my file, I earlier thought I can use this option only to read the data.

udl_file_df_read.write.option("header" ="true" ). format("csv").partitionBy("Date").mode("append").save(outputPath)

prady
  • 563
  • 4
  • 9
  • 24