0

i'm new in Scala/Spark programming and I need to save a DataFrame as a XML file, I get the DataFrame from a HQL (Hive) query.

It is a simple DataFrame (no arrays or other complex type)

I already researched and I found spark-xml library, but seems that this library doesn't work for this problem.

  • 1
    I had not used the library before, but a quick search in the [**README**](https://github.com/databricks/spark-xml#features) shows it allows to _"write"_ `DataFrames` as **XML** files in either a distributed or local filesystem. Thus, why do you said it does not work _"for this problem"_. What is your problem? Had you already tried it? Did your code compiled with errors? Did the outputed file was not as you expected? Did it fail at runtime? - If any of these are true, please edit your question given a detailed explanation and including concrete error messages. – Luis Miguel Mejía Suárez Mar 10 '19 at 01:53
  • I have a DataFrame and I just want to convert to a XML file but with the code below it create a folder with 4 files, I just want 1 single file with the XML extension `selectedData.write .option("rootTag", "books") .option("rowTag", "book") .xml("newbooks.xml")` – William Spader Mar 10 '19 at 03:34
  • [This answer](https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv) provides a good explanation of why did it created a folder with four files _(the original question was for a CVS file, but the internal are the same)_. TL;DR; **Spark** is a distributed framework. As such, writing just _"one file"_ does not makes sense on a real _(production)_ use case - However, if it is just for learning / testing, or because the final `DataFrame` is just a compiled report and thus you are sure it will be small. You can tell Spark to join all partitions of your df in one. – Luis Miguel Mejía Suárez Mar 10 '19 at 03:48

1 Answers1

0

You can use spark-xml APIs from Databricks to save spark dataframe to xml file. Something like below..

val selectedData = df.select("author", "_id")
selectedData.write
    .format("com.databricks.spark.xml")
    .option("rootTag", "books")
    .option("rowTag", "book")
    .save("newbooks.xml")

"com.databricks" %% "spark-xml" % "0.4.1"

KZapagol
  • 888
  • 6
  • 9