1

I am learning Spark, I have below xml from which I want to read 2 values and create two different columns

<appRoot>
    <applist>
      <app type="test">
        <code>8.52544</code>
      </app>
      <app type="dev">
        <code>8.52537</code>
      </app>
    </applist>
</appRoot>

I want

if type="test" then it should set the value (i.e. 8.52544) in new column "app_test" and

if type="dev" then it should set the value (i.e. 8.52537) in new column "app_dev"

I tried below

df.select(
functions.when($"applist.app._type" === "test", $"applist.app.code").as("app_test"),
functions.when($"applist.app._type" === "dev", $"applist.app.code").as("app_dev"))

but it returns

app_test with value [8.52544, 8.52537]
app_dev with value [8.52544, 8.52537]

How can i differentiate while setting the value to the column?

Update:

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rootTag", "appRoot")
      .option("rowTag", "applist")
      .load("test.xml")

    df.printSchema()



root
 |-- app: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- code: double (nullable = true)
mck
  • 40,932
  • 13
  • 35
  • 50
happy
  • 2,550
  • 17
  • 64
  • 109
  • @vaquarkhan FYI, I have not downvoted your answer and if you check my first version of question, it was always to fetch value from an array type column into two separate columns. – happy Sep 21 '20 at 16:53
  • I am happy you got your answer :) – vaquar khan Sep 21 '20 at 17:06

2 Answers2

1

Using map_from_entries the array can be transformed into a map using _type as keys and code as values. Then the map can be used to create the two new columns.

val df = spark.read.format("com.databricks.spark.xml")
   .option("rootTag", "appRoot").option("rowTag","applist")
   .load(...)

df.withColumn("map", map_from_entries('app))
   .withColumn("app_test", 'map("test"))
   .withColumn("app_dev", 'map("dev"))
   .drop('map)
   .show(false)

prints (with some additional test data)

+--------------------------------------------+--------+-------+
|app                                         |app_test|app_dev|
+--------------------------------------------+--------+-------+
|[[test, 8.52544], [dev, 8.52537], [x, 1.23]]|8.52544 |8.52537|
|[[test, 1.2], [dev, 3.4], [X, 9.9]]         |1.2     |3.4    |
|[[dev, -5.6], [Z, -12.9]]                   |null    |-5.6   |
+--------------------------------------------+--------+-------+
werner
  • 13,518
  • 6
  • 30
  • 45
  • What did you specify in the rootTag and rowTag for reading the XML? Since specifying rowTag as applist gets only single row input even if the xml contains multiple app tags. – Shrey Jakhmola Sep 22 '20 at 11:33
  • 1
    @ShreyJakhmola I have added the code for loading the xml. In my example xml file [here](https://pastebin.com/u3ccpxXY), I have added multiple `applist` entries to produce the additional rows. But you can savely ignore these additional lines if you don't need them – werner Sep 22 '20 at 19:47
0

You can do this by:

Read the data into a dataframe.

val df = spark.read.format("com.databricks.spark.xml").option("rootTag", "appRoot").option("rowTag","applist").load("/home/shrey/Downloads/mydoc.xml")
df.printSchema()
df.show()

enter image description here

Finally, you can select the appropriate columns in when condition by exploding the array:

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).show(false)

enter image description here

If you want the output in one line can you can aggregate it using max() functions.

df.select(explode($"app").as("app")).select(functions.when($"app._type" === "test", $"app.code").as("app_test"),functions.when($"app._type" === "dev", $"app.code").as("app_dev")).agg(max($"app_test").as("app_test"),max($"app_dev").as("app_dev")).show(false)

's

Shrey Jakhmola
  • 522
  • 3
  • 9
  • please check my updated xml, I have added a root tag..The desired field i.e. "app" is of type array – happy Sep 21 '20 at 14:19