1

I have xml and need to flat it to one row.

<?xml version="1.0" encoding="ISO-8859-1"?>
    <DMessage xmlns:out="http://......">
      <Message>
        <Header Action="update" UpdatedBy="LUB"/>
        <IP>
          <Identifiers>
            <Id Type="ID">1234</Id>
            <Id Type="ID2">5678</Id>
          </Identifiers>
          <Names>
            <Name NameType="Default">
              <NameComponent NameComponentType="FullName">uncle</NameComponent>
              <NameComponent NameComponentType="ShortName">sam</NameComponent>
            </Name>
          </Names>
          <Parents>
            <Parent ParentType="T1">
              <Id Type="ID">NONE</Id>
              <Id Type="ID2">0</Id>
            </Parent>
            <Parent ParentType="T2">
              <Id Type="ID"></Id>
              <Id Type="D2"></Id>
            </Parent>
          </Parents>
        </IP>
      </Message>
    </DMessage>

I need to have one row with all possible values:

Action  UpdatedBy   Identifier_ID   Identifier_ID2  FullName    ShortName   Parent_T1_ID    Parent_T1_ID2   Parent_T2_ID    Parent_T1_ID2
update  LUB         1234            5678            uncle       sam         NONE            0   

        

My setup:

spark.read()
.format("com.databricks.spark.xml")
.option("rootTag", "out:Message")
.load(filePath.toString())

And I have tried to have method to flat it out for all values, but no luck:

private static List<Column> flattenSchema(StructType schema, String prefix) {
    List<Column> out = new ArrayList<>();
    Arrays.stream(schema.fields()).forEach(f -> {
      String colName = prefix == null ? f.name() : prefix + "." + f.name();
      if (f.dataType() instanceof StructType) {
        flattenSchema(new StructType(new StructField[]{f}), colName);
      } else {
        out.add(new Column(colName));
      }
    });
    return out;
  }

What would the correct and easy way to proceed?

babboon
  • 683
  • 3
  • 20
  • 45

0 Answers0