1

I am reading xml file in scala

<tag1>
  <tag2 id="0" attr1="abc" ... />
  ..
</tag1>

This was already reported as issue and closed . https://github.com/databricks/spark-xml/pull/303

However i am not able to resolve this.

import org.apache.spark.sql.SparkSession
import com.databricks.spark.xml._
import org.apache.spark.sql.types.{StructType, StructField, DoubleType,StringType}
import org.apache.spark.sql.{Row, SaveMode}

object stack {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.getOrCreate()

    val customSchema = StructType(Array(
      StructField("id", DoubleType, nullable = true),
      StructField("attr1", StringType, nullable = true),
      ...
      ...
    ))  
    val df = spark.read
        .option("rowTag", "tag2")
        .format("com.databricks.spark.xml")
        .schema(customSchema)
        .load("dummy.xml")

    import spark.sql
    import spark.implicits._

    df.createOrReplaceTempView("temp1")
    sql("SELECT * from temp1 limit 5").show()
  }
}

However df.show(5) displays no rows.

The resolution talks about using XmlInputFormat which i have not tried , if someone can guide then it will be helpful.

Similar type of solution works with nested xml file.

<books>
  <book> .. </book>
  <name> abc </name>
</books>

I want to see the dataframe with values to show. and later i want to read many xml files and join them in a sql query.

Mikhail Ionkin
  • 568
  • 4
  • 20
maks
  • 11
  • 4

2 Answers2

1

You need to add _ prefix for attributes.

Data (dummy.xml):

<tag1>
    <tag2 id="0" attr1="abc"/>
    <tag2 id="1" attr1="abd" />
    <tag2 id="2" attr1="abd" />
</tag1>

Solution:

package main

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}

object Main extends App {
  val spark = SparkSession.builder.config("spark.master", "local").getOrCreate()

  val customSchema = StructType(Array(
    StructField("_id", DoubleType, nullable = true),
    StructField("_attr1", StringType, nullable = true)
  ))
  val df = spark.read
    .option("rowTag", "tag2")
    .format("com.databricks.spark.xml")
    .schema(customSchema)
    .load("dummy.xml")
  import spark.sql

  df.createOrReplaceTempView("temp1")
  sql("SELECT * from temp1 limit 5").show()
}

Result:

+---+------+
|_id|_attr1|
+---+------+
|0.0|   abc|
|1.0|   abd|
|2.0|   abd|
+---+------+

How I got it:

  1. Found that problem with scheme, because it work with children elements
  2. Remove (or comment) custom schema (// .schema(customSchema))
  3. Print schema that spark is resolve (df.printSchema())
  4. Find what's you need
  5. Create new schema

See also: Extracting tag attributes from xml using sparkxml

PS: Sorry for my English

Mikhail Ionkin
  • 568
  • 4
  • 20
  • Hi Mikhail, thanks for the response. This seems a correct approach however still when i do printschema it shows nothing. i am running it like below: SPARK_HOME/bin/spark-submit --master local[*] --packages com.databricks:spark-xml_2.11:0.4.1 target/scala-2.11/stack_2.11-1.0.jar – maks Jul 03 '19 at 03:01
  • @maks If you want, you can mark answer as "accepted". I'm don't understand why you need to use SPARK_HOME. I'm only use Intelij Community and click to run app, after adding libraries to build.sbt. Sorry for my English. – Mikhail Ionkin Jul 03 '19 at 21:21
  • it is actually $SPARK_HOME to run spark-submit. I ran it in local installation on linux. – maks Jul 04 '19 at 19:43
0

Thanks Mikhail for providing guidance however issue was very small. Sorry for not providing the actual xml file record earlier as the issue was in the attributes.

<users>
    <row Id="-1" Reputation="1" ..... />
</users>

The attributes were starting with caps, when i made them in small then this my solution started working(ofcourse i printed the schema before using it as suggested by Mikhail)

maks
  • 11
  • 4