1

I want to try parsing Excel XML Spreadsheet file with MSXML and XPath.

It has a root element of <Workbook xmlns.... xmlns....> and a bunch of next-level nodes <Worksheet ss:Name="xxxx">.

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">

....

 <Worksheet ss:Name="Карточка">

....

 </Worksheet>
 <Worksheet ss:Name="Баланс">
...
...
...
  </Worksheet>
</Workbook>

At a certain step I want to use XPath to get the very names of the worksheets.

NOTE: I do not want the get the names indirectly, that is to select those Worksheet nodes first and then enumerating them manually read their ss:Name child attribute nodes. That I can do, and it is not the topic here.

What I want is to utilize XPath flexibility: to directly fetch those ss:Name nodes without extra indirection layers.

procedure DoParseSheets( FileName: string );
var
  rd: IXMLDocument;
  ns: IDOMNodeList;
  n: IDOMNode;
  sel: IDOMNodeSelect;
  ms:  IXMLDOMDocument2;
  ms1: IXMLDOMDocument;
  i: integer;
  s: string;
begin
  rd := TXMLDocument.Create(nil);

  rd.LoadFromFile( FileName );

  if Supports(rd.DocumentElement.DOMNode,
     IDOMNodeSelect, sel) then
  begin
    ms1 := (rd.DOMDocument as TMSDOMDocument).MSDocument;
    if Supports( ms1, IXMLDOMDocument2, ms) then begin
       ms.setProperty('SelectionNamespaces',
            'xmlns="urn:schemas-microsoft-com:office:spreadsheet" '+
            'xmlns:o="urn:schemas-microsoft-com:office:office" '+
            'xmlns:x="urn:schemas-microsoft-com:office:excel" '+
            'xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"');
       ms.setProperty('SelectionLanguage', 'XPath');
    end;

//    ns := sel.selectNodes('/Workbook/Worksheet/@ss:Name/text()');
//    ns := sel.selectNodes('/Workbook/Worksheet/@Name/text()');
    ns := sel.selectNodes('/Workbook/Worksheet/@ss:Name');
//    ns := sel.selectNodes('/Workbook/Worksheet/@Name');
//    ns := sel.selectNodes('/Workbook/Worksheet');

    for i := 0 to ns.length - 1 do
    begin
      n := ns.item[i];
      s := n.nodeValue;
      ShowMessage(s);
    end;
  end;
end;

When I use the dumbed down '/Workbook/Worksheet' query MSXML correctly return the nodes. But as soon as I add the attribute to the query - MSXML returns empty set.

Other XPath implementations like XMLPad Pro or http://www.freeformatter.com/xpath-tester.html correctly return the list of ss:Name attribute nodes. But MSXML does not.

What would be the XPath query text to help MSXML return the attribute nodes with given names ?

UPD. @koblik suggested a link to MS.Net selector (not MSXML one) and there are two examples there https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx

  • Example 1: book[@style] - All elements with style attributes, of the current context.
  • Example 2: book/@style - The style attribute for all elements of the current context.

That is the difference I told in the "NOTE" above: I don't need those books, I need the styles. I need attribute-nodes, not element-nodes! And that Example 2 syntax is what MSXML seems to fail at.

UPD.2: One tester shows an interesting error claim: The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:schemas-microsoft-com:office:spreadsheet' I wonder if that claim about no default namespaces in XPath is really part of standard or just MSXML implementation limitation. http://i.imgbox.com/gw9v28ax.png

Then if to delete the default NS the results are how they should be: Variant 1: Variant 2:

I wonder if that claim about no default namespaces in XPath is really part of standard or just MSXML implementation limitation.

UPD.3: Martin Honnen in comments explains that line: See w3.org/TR/xpath/#node-tests for XPath 1.0 (as supported by Microsoft MSXML), it clearly states "A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null". So in XPath 1.0 a path like "/Workbook/Worksheet" selects elements of that name in no namespace.

UPD.4: So the selection works with '/ss:Workbook/ss:Worksheet/@ss:Name' XPath query, returning "ss:Name" attributes nodes directy. In the source XML document both default (no-prefix) and "ss:" namespaces are bound to the same URI. This URI is acknowledged by the XPath engine. But not the default namespace, which can not be redefined in MSXML XPath engine ( implementing 1.0 specs ). So to make it work, the default namespace should be mapped to another explicit prefix ( either already existing one or a newly created ) via URI and then that substitute prefix would be used in the XPath selection string. Since namespaces matching goes via URI not via prefixes it would not matter if prefixes used in the document and in the query match or not, they would be compared via their URIs.

ms.setProperty('SelectionLanguage', 'XPath');
ms.setProperty('SelectionNamespaces',
   'xmlns:AnyPrefix="urn:schemas-microsoft-com:office:spreadsheet"');

and then

ns := sel.selectNodes( 
       '/AnyPrefix:Workbook/AnyPrefix:Worksheet/@AnyPrefix:Name' );

Thanks to Asbjørn and Martin Honnen for explaining those trivial after-the-fact but not obvious a priori relations.

Community
  • 1
  • 1
Arioch 'The
  • 15,799
  • 35
  • 62
  • Something like `*[@Name]` or `*[@ss:Name]`. see: https://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx – kobik Sep 26 '16 at 16:31
  • `[@ss:Name]` works fine for me, as @kobik mentions. – Ken White Sep 26 '16 at 16:31
  • What exactly does it return ? The list of `Worksheet` nodes (wrong!) or the list of `ss:Name` nodes (correct) ? That is the difference here. I made an update and repeated what I told above. – Arioch 'The Sep 26 '16 at 16:46
  • Perhaps if you provided a more useful XML sample and showed the results you're trying to obtain from that sample, it would make it easier to understand what you're after. I get a list of the nodes that have the `@ss:Name` attribute. – Ken White Sep 26 '16 at 17:02
  • I get the same result as @kobik and KenW. I must be missing your point in "or the list of ss:Name nodes (correct)" What could a list of ss:Name nodes contain, apart from repetitions of the literal text 'ss:Name'? – MartynA Sep 26 '16 at 17:11
  • 1
    Given the `SelectionNamespaces` you have set up and the `SelectionLanguage` as XPath, you need the path `ns := sel.selectNodes('/ss:Workbook/ss:Worksheet/@ss:Name');`. – Martin Honnen Sep 26 '16 at 17:35
  • @MartynA the values of those attributes of course. – Arioch 'The Sep 26 '16 at 19:06
  • @KenWhite you get the list of elements (tags). I can get the list of tags, but I want to skip extra step of traversing those elements to get their internal attribute nodes. I want to get list of attributes themselves, not of elements containing those attributes. PS. The elements are nodes, indeed. But attributes are nodes true. So "list of nodes" is a bit ambiguous here. I want list of nodes too, but I want the nodes that are attributes not tags. – Arioch 'The Sep 26 '16 at 19:36
  • 1
    See https://www.w3.org/TR/xpath/#node-tests for XPath 1.0 (as supported by Microsoft MSXML), it clearly states "A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null". So in XPath 1.0 a path like "/Workbook/Worksheet" selects elements of that name in no namespace. – Martin Honnen Sep 26 '16 at 20:58
  • I asked a similar question, but using the `TXMLDocument` (which if I recall, internally uses MSXML)... http://stackoverflow.com/questions/30687619/how-to-use-xpath-on-txmldocument-which-has-namespace-prefixes – Jerry Dodge Sep 26 '16 at 21:38
  • @JerryDodge I use TXMLDocument here too, as u can see in tags. I saw your Q and it seem tangibly similar but it is not the same question. My Q is about getting attribute-nodes rather than tag-nodes, not about namespaces themselves.. – Arioch 'The Sep 27 '16 at 08:20

1 Answers1

3

The issue is that MSXML doesn't support default namespaces when using XPath. To overcome this, you must give the default namespace an explicit prefix, and use this:

ms.setProperty('SelectionNamespaces',
  'xmlns:d="urn:schemas-microsoft-com:office:spreadsheet" '+
  'xmlns:o="urn:schemas-microsoft-com:office:office" '+
  'xmlns:x="urn:schemas-microsoft-com:office:excel" '+
  'xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"');

Note how I added the d prefix to the default namespace. Then you can do the selection like this:

ns := sel.selectNodes('/d:Workbook/d:Worksheet/@ss:Name');

The reason this works is that when parsing the XML data, MSXML associates the namespace to each node. At this stage it does handle the default namespace, so the Workbook elements get associated with the urn:schemas-microsoft-com:office:spreadsheet namespace.

However, note that it does not store the namespace prefixes! Thus you can use your own prefixes for the namespaces when you set SelectionNamespaces.

Now, when doing the XPath selection, if the nodes have a namespace you have to specify namespaces for all elements in the XPath, like my example above. And then you use your own prefixes which you set with SelectionNamespaces.

  • Why use two prefixes `d` and `ss` bound to the same namespace in a single XPath expression? You could simply use `/ss:Workbook/ss:Worksheet/@ss:Name`. – Martin Honnen Sep 26 '16 at 18:38
  • that would mean rehashing all the XML file, which might be large, and numerous. Alternatively there maybe would be way to delete default namespace... The online XPath tests claims this: *The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:schemas-microsoft-com:office:spreadsheet'* - I don't know if that is true specifications-wise, but if it is - then it is a significant oversite... – Arioch 'The Sep 26 '16 at 19:14
  • I may just remove the default namespace, but then there can be files with explicitly prefixed tags.... I'll check your explanation. And if it is correct seems MSXML XPath just is not suited for this... – Arioch 'The Sep 26 '16 at 19:16
  • @Arioch'The You don't have to change the XML file at all. You just need to tell MSXML to use a specific prefix for the default namespace, and use that namespace in your XPath queries. The two changes I presented in my answer are the only two changes I made to your code. At work we've made helper functions to add these prefixes, makes life a bit easier, I'll add the function to my answer. –  Sep 26 '16 at 19:35
  • @MartinHonnen I did that in an effort to make it more explicit that this is the only change that is needed. It's also how his files are structured, so I think it's best to keep it as close to the files as possible. –  Sep 26 '16 at 19:36
  • But how does it make "d" a default namespace ? I thought that default namespace is the one that has no prefix, isn't it so ? So it seems like you did not overridden the default one, but just introduced new non-default "d" namespace. Why in this specific example of yours would "d" be default and "ss" not? Or does any first namespace get treated default by MSXML XPath? Meaning that pulling "xmlns:ss" string at the first position would make it default too? – Arioch 'The Sep 26 '16 at 19:39
  • 1
    @Arioch'The Nothing makes it default, that's why you need to specify it everywhere where you would otherwise _not_ specify it (due to it being the default). –  Sep 26 '16 at 19:41
  • Expanded my answer, hopefully this clarifies things. And I didn't include our code because we only select nodes, so it doesn't handle your @ attribute thing. –  Sep 26 '16 at 19:57
  • I'll check it. I can get the elements nodes list, i'll see if that would allow to get list of attributes nodes. Thanks. – Arioch 'The Sep 26 '16 at 20:13
  • @Arioch'The You can get the attributes. I used your code and XML with my modifications and I get the two names. The only changes I made was I removed the "..." from the XML (of course) and the changes I mention in my post. –  Sep 26 '16 at 20:34