1

Within my DataFrame object I have a column Foos, as an example

<?xml version="1.0" encoding="utf-8"?> <foos> <foo id="123" X="58" Y="M" /> <foos id="456" X="29" Y="M" /> <foos id="789" X="44" Y="F" /> </foos>

Each <foo> has a foo id, X and Y attribute that I want to create a column for each.

How can I parse the XML such that I can create new columns for each attribute? Does this require a UDF for each attribute, or is it possible to extract all three into separate columns in one function?

So far I receive an error with:

parsed = (lambda x: ET.fromstring(x).find('X').text)
udf = udf(parsed)
parsed_df = df.withColumn("X Column", udf("Foos"))
idclark
  • 948
  • 1
  • 8
  • 27
  • Your XML ET query is wrong. There is no direct child called 'X'. Check https://docs.python.org/3/library/xml.etree.elementtree.html for the correct usage. – mck Nov 06 '20 at 07:33
  • a detailed read here - https://stackoverflow.com/questions/50429315/read-xml-in-spark – dsk Nov 06 '20 at 08:12

1 Answers1

0

As mck suggested the xml doesn't look a correct one , you can install a maven package - com.databricks:spark-xml_2.11:0.10.0 and direcrly read a xml file using spark.read

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rowTag", "foos") \
    .load("/FileStore/tables/test.xml")
df.show(truncate=False)

and this is what I am getting as per your xml file provided , you might need to look into the xml file

+--------------+--------------------------------+
|foo           |foos                            |
+--------------+--------------------------------+
|[, 58, M, 123]|[[, 29, M, 456], [, 44, F, 789]]|
+--------------+--------------------------------+
dsk
  • 1,863
  • 2
  • 10
  • 13