I have xml and need to flat it to one row.
<?xml version="1.0" encoding="ISO-8859-1"?>
<DMessage xmlns:out="http://......">
<Message>
<Header Action="update" UpdatedBy="LUB"/>
<IP>
<Identifiers>
<Id Type="ID">1234</Id>
<Id Type="ID2">5678</Id>
</Identifiers>
<Names>
<Name NameType="Default">
<NameComponent NameComponentType="FullName">uncle</NameComponent>
<NameComponent NameComponentType="ShortName">sam</NameComponent>
</Name>
</Names>
<Parents>
<Parent ParentType="T1">
<Id Type="ID">NONE</Id>
<Id Type="ID2">0</Id>
</Parent>
<Parent ParentType="T2">
<Id Type="ID"></Id>
<Id Type="D2"></Id>
</Parent>
</Parents>
</IP>
</Message>
</DMessage>
I need to have one row with all possible values:
Action UpdatedBy Identifier_ID Identifier_ID2 FullName ShortName Parent_T1_ID Parent_T1_ID2 Parent_T2_ID Parent_T1_ID2
update LUB 1234 5678 uncle sam NONE 0
My setup:
spark.read()
.format("com.databricks.spark.xml")
.option("rootTag", "out:Message")
.load(filePath.toString())
And I have tried to have method to flat it out for all values, but no luck:
private static List<Column> flattenSchema(StructType schema, String prefix) {
List<Column> out = new ArrayList<>();
Arrays.stream(schema.fields()).forEach(f -> {
String colName = prefix == null ? f.name() : prefix + "." + f.name();
if (f.dataType() instanceof StructType) {
flattenSchema(new StructType(new StructField[]{f}), colName);
} else {
out.add(new Column(colName));
}
});
return out;
}
What would the correct and easy way to proceed?