Within my DataFrame object I have a column Foos
, as an example
<?xml version="1.0" encoding="utf-8"?> <foos> <foo id="123" X="58" Y="M" /> <foos id="456" X="29" Y="M" /> <foos id="789" X="44" Y="F" /> </foos>
Each <foo>
has a foo id
, X
and Y
attribute that I want to create a column for each.
How can I parse the XML such that I can create new columns for each attribute? Does this require a UDF for each attribute, or is it possible to extract all three into separate columns in one function?
So far I receive an error with:
parsed = (lambda x: ET.fromstring(x).find('X').text)
udf = udf(parsed)
parsed_df = df.withColumn("X Column", udf("Foos"))