8

I am trying to read xml/nested xml in pyspark using spark-xml jar.

df = sqlContext.read \
  .format("com.databricks.spark.xml")\
   .option("rowTag", "hierachy")\
   .load("test.xml"

when I execute, data frame is not creating properly.

    +--------------------+
    |                 att|
    +--------------------+
    |[[1,Data,[Wrapped...|
    +--------------------+

xml format I have is mentioned below :

enter image description here

Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35
LUZO
  • 1,019
  • 4
  • 19
  • 42

3 Answers3

12

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • i tried executing but i am getting 'children' column in wrappedarray [WrappedArray([1,...| – LUZO May 20 '18 at 06:21
  • can you share the schema you are getting ? – Ramesh Maharjan May 20 '18 at 07:23
  • StructType(List(StructField(Order,LongType,true),StructField(attval,StringType,true),StructField(children,StructType(List(StructField(att,ArrayType(StructType(List(StructField(Order,LongType,true),StructField(attval,StringType,true))),true),true))),true))) – LUZO May 20 '18 at 07:32
  • AFAIK this just won't work. `rowTag` cannot be repeated deeper in the hierarchy. – Alper t. Turker May 20 '18 at 10:26
  • @user9613318 I tested it with the exact data that the OP had posted. After successful testing only I answered it :) try it if you think it won't work – Ramesh Maharjan May 20 '18 at 11:32
  • The exact data is way to small to correctly test parser behavior. Especially in `local` mode. Spark XML will seek file for the first occurrence of the row tag since split offset and start parsing from this point. So no, it is not deterministic behavior (same as testing on few records in local mode is not a proof of correctness in general). – Alper t. Turker May 20 '18 at 11:34
  • I never thought so deeply . the user has asked a new question any way https://stackoverflow.com/questions/50433468/how-to-convert-multiple-row-tag-xml-files-to-dataframe – Ramesh Maharjan May 20 '18 at 11:40
  • It is better to. Spark is a leaking abstraction, and "seems to work" is just not good enough, especially if things can fail silently. – Alper t. Turker May 20 '18 at 13:00
3

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

NNK
  • 1,044
  • 9
  • 24
3

You can use Databricks jar to parse the xml to a dataframe. You can use maven or sbt to compile the dependency or you can directly use the jar with spark submit.

pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "SmsRecords") \
    .option("rowTag", "sms") \
    .load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")

Schema>>> df.printSchema()
root
 |-- address: string (nullable = true)
 |-- body: string (nullable = true)
 |-- date: long (nullable = true)
 |-- type: long (nullable = true)

>>> df.select("address").distinct().count()
530 

Follow this http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html