1

This is driving me crazy, I'm working on a project that requires me to parse XML documents (UBL format).

These documents might have one or more attachments in their body (base64 encoded) and it's my job to get them out. I also need to get some other fields out (some of which work), for brevity i'll rename namespaces to "A:", "B:", etc

Example xml (heavily simplified)

<Invoice>
    <A:ID>*someID*</A:ID>
    <B:AdditionalDocumentReference>
        <A:ID>attachmentID</A:ID>
        <B:OrderReference>
            <A:ID>16009896</A:ID>
        </B:OrderReference>
        <A:DocumentType>PDF</A:DocumentType>
        <B:Attachment>
            <A:EmbeddedDocumentBinaryObject mimeCode="application/pdf">
                *base64 encoded string*
            </A:EmbeddedDocumentBinaryObject>
       </B:Attachment>
    </B:AdditionalDocumentReference>
</Invoice>

Problem 1: I can't assume that the root element will be named "Invoice".

To retrieve the attachments I use:

XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate("//AdditionalDocumentReference",
                    doc.getDocumentElement(), XPathConstants.NODESET);

This returns nothing, I have also tried:

.//AdditionalDocumentReference

And

//B:AdditionalDocumentReference

Neither of which works, the only thing that works is:

//Invoice/AdditionalDocumentReference

But as I said above the root element might be named differently so not really an option.

I have the same issue with getting the ID from my document. I thought that the easiest way would be to use:

//ID

I know the first occurrence of that ID tag is the ID of the document but this also returns nothing. It only works if I use:

//Invoice/ID

Now for the really weird part. See that order reference tag? I use this one-liner:

xPath.evaluate("//OrderReference/ID", document)

and it works...

What am i doing wrong?

SomeDude
  • 13,876
  • 5
  • 21
  • 44
killernerd
  • 377
  • 2
  • 6
  • 21
  • It looks like your xpath processor does not support the ``//`` operator, I don't know why though. If you want to allow for any root tag name, you could use ``/*/ID``. – f1sh Feb 13 '17 at 15:13
  • this also happened to me one time with an xml api response. I assume it could be because of corrupt xml structure outside the `Invoice` tag. I would go with @f1sh recommendation – eLRuLL Feb 13 '17 at 15:20
  • Two things to check. First, are you providing your XPath with a `NamespaceContext` that resolves the prefixes to the right namespace URIs? And second, try passing your `doc` object to the evaluate method instead of using `doc.getDocumentElement()`, see how that behaves. – G_H Feb 13 '17 at 15:31
  • @kjhughes This is not a duplicate. If you read the question carefully you'll notice that the namespaces, while they might be related to the problem, are not the issue being asked about. The last example provided works properly so there's some problem with the `//` expression evaluation. – G_H Feb 13 '17 at 16:14
  • @G_H: Really, there's no [mcve] presented, so if you'd prefer, vote to close on those grounds instead. The XML presented is [***not namespace-well-formed***](http://stackoverflow.com/a/25830482/290085), so we cannot trust any subsequent reports of what's "working," especially the last XPath which could only work for elements in no namespace. This is namespace problem that cannot be fully diagnosed because of the lack of a MCVE. At least with the duplicate link, OP can see what's needed to solve the problem: Fix the namespace declaration problems in both the XML and the XPath library. – kjhughes Feb 13 '17 at 16:20
  • @kjhughes But now the question is closed and the asker can't correct it to create a well-formed XML, and if this was purely a namespace problem then his last example shouldn't return **anything**, yet it does. But note how there he provided the document rather than using getDocumentElement(), which as I suggested might in fact be the problem. And if it is, that would make this question **not a duplicate** and could make it of use to others encountering the problem. But we can't find out if questions get closed faster than someone can edit them. – G_H Feb 13 '17 at 16:35
  • @G_H: Your statement that asker cannot edit a closed question is false, and the rest of your comment indicates that you've either failed to understand or chosen to ignore my previous response to you. If you truly feel that this is a problem with `//` apart from namespaces, I challenge you to create such an [mcve] and post your own question. Good luck. – kjhughes Feb 13 '17 at 16:58
  • @kjhughes I didn't know about editing closed questions, my apologies. As for creating such an example, I managed to do so with minimal effort by writing a test class and just binding the `A` and `B` prefixes in the asker's XML. I can definitely replicate his results. The solution requires two steps however: setting a namespace context **and** making sure the `DocumentBuilderFactory` used is set to be namespace aware. While this can be found in the linked answer, the inconsistent results with doing so or not make it hardly intuitive. – G_H Feb 13 '17 at 17:23
  • @G_H: You've put a lot of time into this. I would be happy to reopen if you would like to write up an answer that's more catered to this problem than the generic XPath namespaces duplicate link. Thanks for all your effort. – kjhughes Feb 13 '17 at 17:38
  • @kjhughes It's a bit of a borderline case, since the answer is to be found in the provided link, but not directly (it's for example not specific to Java) and it would be easy for the OP to miss the finer points. The Java DOM and XPath implementations also seem to have some quirks that lead to very odd results which would distract from the core issue. I'll leave it up to your discretion. If you re-open I'll answer, but I think OP knows enough by now which is what I believe is most important. – G_H Feb 13 '17 at 17:46
  • @kjhughes What threw me off was that my last query works and a "direct" xpath (//Invoice/...) also works, which makes it hard to diagnose what the actual problem is. So if it really were a namespace issue I'd expect that that those 2 wouldn't work either. The XML has been validated so I know it's well formed and valid. I could post it here but it's quite a large document and I'm not sure what good it would do. But i'll check the namespaces solution just to be sure. – killernerd Feb 14 '17 at 08:06
  • @G_H I did try passing in the document object itself, same result. I'll also check the namespaces solution just to make sure. – killernerd Feb 14 '17 at 08:07
  • @killernerd I only got the consistent and correct results by setting the DocumentBuilderFactory to be namespace aware before creating the DocumentBuilder with it, and implementing a `javax.xml.namespace.NamespaceContext` and setting it on the XPath instance (and of course then using the right bound prefixes in the expression). Without those steps the behaviour seems to be unpredictable, or at least inconsistent. – G_H Feb 14 '17 at 09:02
  • @G_H I find this pretty strange, I just want all nodes with a certain name no matter what their namespace might be... In any case, I did find a workaround by using 'document.getFirstchild().getNodeName()' and using that value in my xPath. That together with your suggestion of passing in the doc object directly seems to work, I'm hesitant to make it the accepted answer though because it's a bit of a hack but I can't spent more time on it... (deadlines are a PITA sometimes) – killernerd Feb 17 '17 at 12:36
  • @killernerd If you want to ignore namespaces you can use XPath segments of this form: `//*[local-name()='AdditionalDocumentReference']` This example will select all elements which have a local name AdditionalDocumentReference, regardless of namespace. – G_H Feb 17 '17 at 12:49
  • @G_H It's still weird that some xPaths did return values and some didn't. But in the end I guess this really was a namespace issue and thus a duplicate question. So uhm, post that last comment as an answer and i'll mark it as such or lock this question/mark it as duplicate. Thanks for the help. – killernerd Feb 20 '17 at 07:29
  • @killernerd Just dupe, I guess. I'm a bit too busy at work to formulate it into a full answer :D If someone comes in with the same question they can follow the link and work it out from there along with the comments here. – G_H Feb 20 '17 at 09:51

0 Answers0