After what I'm seeing, your processable entity should be metadata > person > personalName
since Superman and Batman are not the same heroes, right? If so, when you specified metadata
as a rowTag
, Spark took everything under metadata
and transformed it into a row with the following schema:
root
|-- person: struct (nullable = true)
| |-- personalName: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- surname: string (nullable = true)
When you change the rowTag into personalName
, you will get the following schema:
root
|-- surname: string (nullable = true)
If still, your expected schema should be located at metadata
, you should operate on the row with Apache Spark SQL array functions, for instance like this:
<metadata>
<person>
<personalName>
<surname>Superman</surname>
</personalName>
<personalName>
<surname>Batman</surname>
</personalName>
</person>
</metadata>
<metadata>
<person>
<personalName>
<surname>XX</surname>
</personalName>
</person>
</metadata>
<metadata>
<person>
<personalName>
<surname>EEEE</surname>
</personalName>
<personalName>
<surname>Batman</surname>
</personalName>
<personalName>
<surname>XXXXX</surname>
</personalName>
</person>
</metadata>
val df = sparkSession.read.format("com.databricks.spark.xml")
.option("rowTag", "metadata")
.load("./test_xml.xml")
df.filter(functions.array_contains($"person.personalName.surname", "Batman")).show(true)
Which should return:
+-----------------------------+
|person |
+-----------------------------+
|[[[Superman], [Batman]]] |
|[[[EEEE], [Batman], [XXXXX]]]|
+-----------------------------+
You can find a pretty good listing of Spark SQL array functions here: Querying Spark SQL DataFrame with complex types
Update
Searching when there is that struct or repeating block so in my case multiple personalNames
For that you can use Apache Spark higher order and array functions. If you search spark higher order function
, you will find a lot of blog posts and documentation about them. Above I gave you an example about filtering arrays.
How to return everything from the level
It depends on your schema. With rowTag=metadata
, Spark returns you a row composed of a single field being a structure called person
composed only of an array personalName
. If you added more fields inside, it will be still queryable:
<metadata>
<person>
<id>1</id>
<age>30</age>
<personalName>
<surname>Superman</surname>
</personalName>
<personalName>
<surname>Batman</surname>
</personalName>
</person>
</metadata>
<metadata>
<person>
<id>2</id>
<age>40</age>
<personalName>
<surname>XX</surname>
</personalName>
<personalName>
<surname>EEEEE</surname>
</personalName>
</person>
</metadata>
<metadata>
<person>
<id>3</id>
<age>50</age>
<personalName>
<surname>EEEE</surname>
</personalName>
<personalName>
<surname>Batman</surname>
</personalName>
<personalName>
<surname>XXXXX</surname>
</personalName>
</person>
</metadata>
With:
df.filter("person.id > 0")
.filter(functions.array_contains($"person.personalName.surname", "Batman"))
.show(false)
The result will be:
+------------------------------------+
|person |
+------------------------------------+
|[30, 1, [[Superman], [Batman]]] |
|[50, 3, [[EEEE], [Batman], [XXXXX]]]|
+------------------------------------+
Otherwise, if you moved the pointer to person
, your schema will lose a level, so it will be easier to write queries:
df.printSchema()
df.filter("id > 0")
.filter(functions.array_contains($"personalName.surname", "Batman"))
.show(false)
For a schema and data like this:
root
|-- age: long (nullable = true)
|-- id: long (nullable = true)
|-- personalName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- surname: string (nullable = true)
+---+---+---------------------------+
|age|id |personalName |
+---+---+---------------------------+
|30 |1 |[[Superman], [Batman]] |
|50 |3 |[[EEEE], [Batman], [XXXXX]]|
+---+---+---------------------------+