0

I'm trying parse an XML document whose structure I don't know. Someone suggested using XPaths but I have had no luck in trying to get this to work.

Example: What I need is (from the XML document):

Return/ReturnData/IRS1095A{a77f40a2-af31-4404-a27d-4c1eaad730c2}/MonthlyPTCInfo‌​rmationGrpPP{69dc9dd5-5415-4ee4-a199-19b2dbb701be}/MonthlyPlanPremiumAmtPP=136

The callback method should remember the hierarchy of START_ELEMENT at each stage. Using a stack to remember the names and construct an xpath. To have the key value pair

Code:

public Map<String, String> p(File file) throws Exception {

    Map<String, String> map = new HashMap<String,String>();
    XMLStreamReader xr = XMLInputFactory.newInstance().createXMLStreamReader(new FileInputStream(file));

    String name = "", value = "", attrName = "";

    while(xr.hasNext()) {
        int e = xr.next();

        switch(e) {
            case XMLStreamReader.START_ELEMENT: {
                name = xr.getLocalName();
                final int attributeCount = xr.getAttributeCount();

                if(attributeCount > 0) {
                    attrName = xr.getAttributeName(0).getLocalPart();
                    final String attributeValue = xr.getAttributeValue(0);
                        logger.debug("Key: "+name+" AttributeName: "+attrName+" Attribute value: "+attributeValue);
                }
                break;
            }

            case XMLStreamReader.CHARACTERS:
                value = xr.getText();
                break;
        }

        Map<String, String> map = new HashMap<String,String>();
        map.put(name, value);

        logger.debug("This is Map: "+map);
    }
    return null;//map;
}

XML Document:

<Return xmlns="http://www.irs.gov/efile">
  <ReturnData>
    <IRS1095A uuid="a77f40a2-af31-4404-a27d-4c1eaad730c2">
      <MonthlyPTCInformationGrpPP uuid="69dc9dd5-5415-4ee4-a199-19b2dbb701be">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthCdPP>SEPTEMBER</MonthCdPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="8495fa61-0e7c-45e3-8f07-9765f4ef2fc3">
        <MonthCdPP>OCTOBER</MonthCdPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="7de1052f-6107-41da-aea4-e4495018fc80">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthCdPP>APRIL</MonthCdPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="634d5af9-51fb-42ee-a90d-5a4f421e6854">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthCdPP>JUNE</MonthCdPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="a2f7de3f-650c-4a5e-b26c-30cfd7782d6c">
        <MonthCdPP>MAY</MonthCdPP>
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="a77f40a2-af31-4404-a27d-4c1eaad730c2">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthCdPP>JANUARY</MonthCdPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="01650aee-9d5d-4ce1-9079-ebedea3bf416">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthCdPP>MARCH</MonthCdPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="581ba189-222d-4999-aa1a-3b290666ef5f">
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthCdPP>AUGUST</MonthCdPP>
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
      </MonthlyPTCInformationGrpPP>
      <TotalPremiumSLCSPAmtPP>3000</TotalPremiumSLCSPAmtPP>
      <MonthlyPTCInformationGrpPP uuid="549ff57a-58dc-4365-b05c-e3e520b3e8cb">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
        <MonthCdPP>DECEMBER</MonthCdPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="195836cf-32b3-4316-99d4-6b1eab31e16d">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthCdPP>JULY</MonthCdPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
      </MonthlyPTCInformationGrpPP>
      <MonthlyPTCInformationGrpPP uuid="c1289d91-7ce1-41ee-9c8a-f72212e82752">
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthCdPP>FEBRUARY</MonthCdPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
      </MonthlyPTCInformationGrpPP>
      <TotalAdvancedPTCAmtPP>1500</TotalAdvancedPTCAmtPP>
      <RecipientSSNPP>555-11-2222</RecipientSSNPP>
      <MonthlyPTCInformationGrpPP uuid="50876222-165d-442a-81e0-0b05dc3c30fb">
        <MonthlyAdvancedPTCAmtPP>125</MonthlyAdvancedPTCAmtPP>
        <MonthlyPlanPremiumAmtPP>136</MonthlyPlanPremiumAmtPP>
        <MonthCdPP>NOVEMBER</MonthCdPP>
        <MonthlyPremiumSLCSPAmtPP>250</MonthlyPremiumSLCSPAmtPP>
      </MonthlyPTCInformationGrpPP>
      <TotalPlanPremiumAmtPP>1632</TotalPlanPremiumAmtPP>
    </IRS1095A>
    <IRS1040>
      <IndividualReturnFilingStatusCd>1</IndividualReturnFilingStatusCd>
      <WagesSalariesAndTipsAmt>22000</WagesSalariesAndTipsAmt>
      <TotalExemptionsCnt>1</TotalExemptionsCnt>
      <AdjustedGrossIncomeAmt>22000</AdjustedGrossIncomeAmt>
    </IRS1040>
  </ReturnData>
  <ReturnHeader>
    <SelfSelectPINGrp>
      <PrimaryBirthDt>1970-01-01</PrimaryBirthDt>
    </SelfSelectPINGrp>
    <Filer>
      <PrimarySSN>555-11-2222</PrimarySSN>
      <PrimaryResidentStatesInfoGrpPP>
        <ResidentStateInfoPP uuid="a77f40a2-af31-4404-a27d-4c1eaad730c2">
          <ResidentStateAbbreviationCdPP>CA</ResidentStateAbbreviationCdPP>
        </ResidentStateInfoPP>
      </PrimaryResidentStatesInfoGrpPP>
    </Filer>
  </ReturnHeader>
</Return>
mosawi
  • 1,283
  • 5
  • 25
  • 48
  • Huh? Xpath is a query expression language to extract data from XML. How can you , "parse an XML document whose structure I don't know?" – OldProgrammer Aug 19 '15 at 16:45
  • Parse it in what way? Why don't you know the structure? What do you need to get out of the document? – Dave Newton Aug 19 '15 at 16:50
  • [http://stackoverflow.com/questions/2811001/how-to-read-xml-using-xpath-in-java] refer this – Akshay jain Aug 19 '15 at 17:00
  • It seems to me like you're trying to generate something that resembles an xpath (`{}` would be invalid) based on the input XML. It also seems that there would be multiple xpath like strings generated; one for every element that contains text. Is any of this true? – Daniel Haley Aug 19 '15 at 17:02
  • @DaveNewton Parse using a method like stAx, as I am using currently. I need to the element and associated value and put it into a key/value pair Map. I don't know the structure in advance, so it could be any XML document, so I can't hardcode node elements such as getElementByTageName("foo") because I won't know the tag name. – mosawi Aug 19 '15 at 17:13
  • @OldProgrammer using stAx I am currently parsing the document fine without knowing the structure (when I say structure I mean the name of the tag elements) this way I should be able to throw in any XML document and the output should give me /Return/ReturnData/etc/etc=136 – mosawi Aug 19 '15 at 17:14
  • @akshayjain that method you linked to would mean to know the structure in advance – mosawi Aug 19 '15 at 17:15
  • @DanielHaley yes and no, the elements with similar name would be distinguished by their UUID in the XPATH as seen in the example. – mosawi Aug 19 '15 at 17:16
  • OK, but what you are doing has nothing to do with Xpath per-se. – OldProgrammer Aug 19 '15 at 17:17
  • Yes, that's what I'd like to do. Hence my question. Currently it's giving me the value example MonthlyPremiumSLCSPAmtPP=125. – mosawi Aug 19 '15 at 17:18

0 Answers0