0

I'm new to Spark and Python - working in a Jupyter notebook.

Say I have the following XML:

<metadata>
    <person>
        <personalName>
            <surname>The Batmaner</surname>
        </personalName>
        <personalName>
            <surname>Batman</surname>
        </personalName>
    </person>
</metadata>

I've loaded this into a DateFrame using:

df = spark.read.format("com.databricks.spark.xml").options(rowTag='metadata').load(path)

Now I can see the data in df I want to pull out the whole section from <person> if surname==Batman

Essentially, I'm confused about 2 bits:

  • Searching when there is that struct or repeating block so in my case multiple personalNames
  • How to return everything from the <person> level

Any help would be appreciated.

Thanks.

userMod2
  • 8,312
  • 13
  • 63
  • 115

1 Answers1

0

After what I'm seeing, your processable entity should be metadata > person > personalName since Superman and Batman are not the same heroes, right? If so, when you specified metadata as a rowTag, Spark took everything under metadata and transformed it into a row with the following schema:

root
 |-- person: struct (nullable = true)
 |    |-- personalName: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- surname: string (nullable = true)

When you change the rowTag into personalName, you will get the following schema:

root
 |-- surname: string (nullable = true)

If still, your expected schema should be located at metadata, you should operate on the row with Apache Spark SQL array functions, for instance like this:

<metadata>
    <person>
        <personalName>
            <surname>Superman</surname>
        </personalName>
        <personalName>
            <surname>Batman</surname>
        </personalName>
    </person>
</metadata>
<metadata>
<person>
    <personalName>
        <surname>XX</surname>
    </personalName>
</person>
</metadata>
<metadata>
<person>
    <personalName>
        <surname>EEEE</surname>
    </personalName>
    <personalName>
        <surname>Batman</surname>
    </personalName>
    <personalName>
        <surname>XXXXX</surname>
    </personalName>
</person>
</metadata>
val df = sparkSession.read.format("com.databricks.spark.xml")
  .option("rowTag", "metadata")
  .load("./test_xml.xml")

df.filter(functions.array_contains($"person.personalName.surname", "Batman")).show(true)

Which should return:

+-----------------------------+
|person                       |
+-----------------------------+
|[[[Superman], [Batman]]]     |
|[[[EEEE], [Batman], [XXXXX]]]|
+-----------------------------+

You can find a pretty good listing of Spark SQL array functions here: Querying Spark SQL DataFrame with complex types


Update

Searching when there is that struct or repeating block so in my case multiple personalNames

For that you can use Apache Spark higher order and array functions. If you search spark higher order function, you will find a lot of blog posts and documentation about them. Above I gave you an example about filtering arrays.

How to return everything from the level

It depends on your schema. With rowTag=metadata, Spark returns you a row composed of a single field being a structure called person composed only of an array personalName. If you added more fields inside, it will be still queryable:

<metadata>
    <person>
        <id>1</id>
        <age>30</age>
        <personalName>
            <surname>Superman</surname>
        </personalName>
        <personalName>
            <surname>Batman</surname>
        </personalName>
    </person>
</metadata>
<metadata>
    <person>
        <id>2</id>
        <age>40</age>
        <personalName>
            <surname>XX</surname>
        </personalName>
        <personalName>
            <surname>EEEEE</surname>
        </personalName>
    </person>
</metadata>
<metadata>
    <person>
        <id>3</id>
        <age>50</age>
        <personalName>
            <surname>EEEE</surname>
        </personalName>
        <personalName>
            <surname>Batman</surname>
        </personalName>
        <personalName>
            <surname>XXXXX</surname>
        </personalName>
    </person>
</metadata>

With:

df.filter("person.id > 0")
  .filter(functions.array_contains($"person.personalName.surname", "Batman"))
  .show(false)

The result will be:

+------------------------------------+
|person                              |
+------------------------------------+
|[30, 1, [[Superman], [Batman]]]     |
|[50, 3, [[EEEE], [Batman], [XXXXX]]]|
+------------------------------------+

Otherwise, if you moved the pointer to person, your schema will lose a level, so it will be easier to write queries:

df.printSchema()
df.filter("id > 0")
  .filter(functions.array_contains($"personalName.surname", "Batman"))
  .show(false)

For a schema and data like this:

root
 |-- age: long (nullable = true)
 |-- id: long (nullable = true)
 |-- personalName: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- surname: string (nullable = true)

+---+---+---------------------------+
|age|id |personalName               |
+---+---+---------------------------+
|30 |1  |[[Superman], [Batman]]     |
|50 |3  |[[EEEE], [Batman], [XXXXX]]|
+---+---+---------------------------+
Bartosz Konieczny
  • 1,985
  • 12
  • 27
  • Hi, thanks for your answer. I've updated my question - using superman wasn't clear. What I mean to show was say the the 2 names were actually different but refer to the same. – userMod2 Jul 12 '19 at 08:33
  • Thanks. I added an update. Still, I'm not sure to understand your questions correctly :/ – Bartosz Konieczny Jul 12 '19 at 10:09
  • Thanks - that helped. By the way, when working with a large XML file in the GBs. Is this this `.filter(functions.array_contains($"personalName.surname", "Batman"))` quite performant. I found it takes a little while - or is there an alternative. – userMod2 Jul 13 '19 at 00:58
  • Another approach would consist on "unflattening" the array with explode function and applying a simple filter operation on it. But it's an extra operation on the dataset so probably it won't change a lot. You can see an example here: https://stackoverflow.com/questions/46794062/spark-sql-to-explode-array-of-structure?rq=1 Eventually one thing that can help is the manual definition of the schema. You can check the `inferSchema` parameter https://github.com/databricks/spark-xml schemas =>: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema – Bartosz Konieczny Jul 13 '19 at 16:48