-4

My requirement is to write only Header CSV record using Spark Scala DataFrame. Can any one help me on this.

val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/" 
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")

The above one is working and able to create header in the CSV with tab delimiter. Since I am using spark session I am creating sparkContext in the second line. outDF is my dataframe created before these statements.
Two things are outstanding, can you one of you help me.

1. The above working code is not overriding the files, so every time I need to delete the files manually. I could not find override option, can you help me.
2. Since I am doing a select statement and schema, will it be consider as action and start another lineage for this statement. If it is true then this would degrade the performance.
Revathi P
  • 77
  • 13

3 Answers3

2

If you need to output only header you can use this code:

df.schema.fieldNames.reduce(_ + "," + _)

It will create line of CSV with names of columns

Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
  • I am able to print this output. I want this to be written into an CSV file. As this string not finding option to writing into an file. Also I am selecting only 3 columns out of 10 from the dataframe for my output. – Revathi P Jun 07 '18 at 18:10
  • If you `select` only 3 columns - you don't need to get it from dataframe schema. You can just write these names in file. [Here](https://stackoverflow.com/questions/6879427/scala-write-string-to-file-in-one-statement) are examples how to write to text file. – Vladislav Varslavans Jun 08 '18 at 06:57
1
I tested and the solution below did not affect any performance.

val OHead1 = "/xxxxx/xxxx/xxxx/xxx/OHead1/" 
val sc = sparkFile.sparkContext
val outDF = csvDF.select("col_01", "col_02", "col_03").schema
sc.parallelize(Seq(outDF.fieldNames.mkString("\t"))).coalesce(1).saveAsTextFile(s"$OHead1")
Revathi P
  • 77
  • 13
0

I got a solution to handle this situation. Define the columns in the configuration file and write those columns in an file. Here is the snipet.

val Header = prop.getProperty("OUT_HEADER_COLUMNS").replaceAll("\"","").replaceAll(",","\t")
scala.tools.nsc.io.File(s"$HeadOPath").writeAll(s"$Header")
Revathi P
  • 77
  • 13
  • Both the solutions are working. I would recommend the first solution (picking columns from the schema) and no hard coded values from the property file. – Revathi P Jun 26 '18 at 15:43