2

How to get a single dataframe of all xml files in a Hdfs directory, which having same xml schema using databricks xml parser

Desanth pv
  • 351
  • 1
  • 4
  • 13
  • This is a poorly structured question. You should provide an example of what you have already tried and isn't working for you. You'll get better answers that way. – Davos Apr 24 '17 at 05:57

3 Answers3

4

You can do this using globbing. See the Spark dataframeReader load method. load can take a single path string, a sequence of paths, or no argument for datasouces that don't have paths (i.e. not HDFS or S3 or other file system). http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

val df = sqlContext.read.format("com.databricks.spark.xml")
.option("inferschema","true")
.option("rowTag", "address") //the root node of your xml to be treated as row
.load("/path/to/files/*.xml")

load can take a long string with comma separated paths

.load("/path/to/files/File1.xml, /path/to/files/File2.xml")

Or similar to this answer Reading multiple files from S3 in Spark by date period

You can also use a sequence of paths

val paths: Seq[String] = ...
val df = sqlContext.read.load(paths: _*)

Note that the inferschema is pretty hectic for XML. I've not had a lot of success when there are a lot of files involved. Specifying schema works better. If you can guarantee that your XML files all have the same schema, you could use a small sample of them to infer schema and then load the rest of them in. I think that's not safe though, because XML can still be "valid" even if it is missing some nodes or elements with regard to an XSD.

Davos
  • 5,066
  • 42
  • 66
  • If I have multiple xml which has different root tag e.g, address, department, jobType etc, then how Can I load them in parallel.I – GPopat Dec 03 '19 at 16:22
  • @GaurangPopat if you have one xml with root tag `address` and another with root tag `department` then how could they possibly fit into the same schema? If you want to combine them into one table then they sound like different fields to me, or perhaps they are not root tags and you should go higher in the xml path. – Davos Dec 04 '19 at 05:32
  • @Devos Let me explain my situation. I have 50 xml files that have root tag of address and other 50 xml files that have root tag of department. All 100 exist in same folder. I want to process them the most efficient way i.e distributing the load across the cluster while loading/transformation etc. There is no relation of Department and Address xmls. – GPopat Dec 05 '19 at 13:44
  • @GaurangPopat I think you need a new question to answer that, but I still ask you the same question, how do you expect two different schemas to load into the same dataframe? Load 2 dataframes and then do things with them. You can also just load them as text files and process the schema after they are loaded. – Davos Dec 06 '19 at 01:13
0

I see that you want to read XML data by reading each xml separately and process them individually.below is a skeleton as to how it will look.

import scala.xml.XML

val rdd1 = sc.wholeTextFiles("/data/tmp/test/*")

val xml = rdd1.map(x=>XML.loadString(_._2.toString())

BalaramRaju
  • 439
  • 2
  • 8
0

Setup your maven for databricks dependencies as

https://mvnrepository.com/artifact/com.databricks/spark-xml_2.10/0.2.0

Then use below code in your spark program to read HDFS xml files and create a single dataframe

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)

val df = sqlContext.read .format("com.databricks.spark.xml")

.option("rowTag", "address")  //The row tag of your xml files to treat as a row

.load("file.xml")

val selectedResult = df.select("city", "zipcode")

selectedResult.write

.format("com.databricks.spark.xml")

.option("rootTag", "address") //The root tag of your xml files to treat as the root

.option("rowTag", "address")

.save("result.xml")

Find complete example in github:

https://github.com/databricks/spark-xml/blob/master/README.md

khushbu kanojia
  • 250
  • 1
  • 3