3

So, I am working on a project that scrapes and collects data from many different sources around the internet with many different methods depending on each source's characteristics.

The most recent addition is a web API call which returns the following XML as a response:

<?xml version="1.0"?>
<Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0">
    <mRID>29b526a69b9445a7bb507ba446e3e8f9</mRID>
    <revisionNumber>1</revisionNumber>
    <type>A44</type>
    <sender_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</sender_MarketParticipant.mRID>
    <sender_MarketParticipant.marketRole.type>A32</sender_MarketParticipant.marketRole.type>
    <receiver_MarketParticipant.mRID codingScheme="A01">10X1001A1001A450</receiver_MarketParticipant.mRID>
    <receiver_MarketParticipant.marketRole.type>A33</receiver_MarketParticipant.marketRole.type>
    <createdDateTime>2019-09-19T11:28:51Z</createdDateTime>
    <period.timeInterval>
        <start>2019-09-18T22:00Z</start>
        <end>2019-09-19T22:00Z</end>
    </period.timeInterval>
    <TimeSeries>
        <mRID>1</mRID>
        <businessType>A62</businessType>
        <in_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</in_Domain.mRID>
        <out_Domain.mRID codingScheme="A01">10YCS-SERBIATSOV</out_Domain.mRID>
        <currency_Unit.name>EUR</currency_Unit.name>
        <price_Measure_Unit.name>MWH</price_Measure_Unit.name>
        <curveType>A01</curveType>
        <Period>
            <timeInterval>
                <start>2019-09-18T22:00Z</start>
                <end>2019-09-19T22:00Z</end>
            </timeInterval>
            <resolution>PT60M</resolution>
            <Point>
                <position>1</position>
                <price.amount>44.08</price.amount>
            </Point>
            <Point>
                <position>2</position>
                <price.amount>37.14</price.amount>
            </Point>
            <Point>
                <position>3</position>
                <price.amount>32.21</price.amount>
            </Point>
            <Point>
                <position>4</position>
                <price.amount>31.44</price.amount>
            </Point>
            <Point>
                <position>5</position>
                <price.amount>32.48</price.amount>
            </Point>
            <Point>
                <position>6</position>
                <price.amount>45.52</price.amount>
            </Point>
            <Point>
                <position>7</position>
                <price.amount>56.05</price.amount>
            </Point>
            <Point>
                <position>8</position>
                <price.amount>74.96</price.amount>
            </Point>
            <Point>
                <position>9</position>
                <price.amount>74.08</price.amount>
            </Point>
            <Point>
                <position>10</position>
                <price.amount>69.03</price.amount>
            </Point>
            <Point>
                <position>11</position>
                <price.amount>72.89</price.amount>
            </Point>
            <Point>
                <position>12</position>
                <price.amount>68.91</price.amount>
            </Point>
            <Point>
                <position>13</position>
                <price.amount>74.95</price.amount>
            </Point>
            <Point>
                <position>14</position>
                <price.amount>72.91</price.amount>
            </Point>
            <Point>
                <position>15</position>
                <price.amount>75.97</price.amount>
            </Point>
            <Point>
                <position>16</position>
                <price.amount>76.49</price.amount>
            </Point>
            <Point>
                <position>17</position>
                <price.amount>59.08</price.amount>
            </Point>
            <Point>
                <position>18</position>
                <price.amount>60.19</price.amount>
            </Point>
            <Point>
                <position>19</position>
                <price.amount>64.69</price.amount>
            </Point>
            <Point>
                <position>20</position>
                <price.amount>69.18</price.amount>
            </Point>
            <Point>
                <position>21</position>
                <price.amount>64.97</price.amount>
            </Point>
            <Point>
                <position>22</position>
                <price.amount>63.38</price.amount>
            </Point>
            <Point>
                <position>23</position>
                <price.amount>52.92</price.amount>
            </Point>
            <Point>
                <position>24</position>
                <price.amount>48.08</price.amount>
            </Point>
        </Period>
    </TimeSeries>
</Publication_MarketDocument> 

Having dealt successfully with situations like that using Microsoft XML, v6.0 I tried the following:

Dim respXML As New MSXML2.DOMDocument60
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1")) 'for the sake of the post's simplicity I'm loading the xml from excel
Debug.Print respXML.getElementsByTagName("price.amount").Length

This should be returning 24 but instead it returns 0. Indeed the following:

Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

returns True, which means that the <price.amount></price.amount> elements are not being found. However, Debug.Print respXML.XML yields the expected results.

I read somewhere that early binding could be causing problems so I tried the following as well:

Dim respXML As Object
Set respXML = CreateObject("MSXML2.DOMDocument.6.0")
respXML.LoadXML (ThisWorkbook.Worksheets("Sheet2").Range("A1"))
Debug.Print respXML.getElementsByTagName("price.amount").Length
Debug.Print respXML.getElementsByTagName("price.amount")(1) Is Nothing

Still the results are the same.

Switching to Microsoft XML, v3.0 resolves the issue completely.

However, I would prefer sticking to v6.0 since it's the one that is more actively being maintained and supported.

Why does this happen? Does it have to do with the XML itself? Does it have to do with my code? Am I missing something? Is there a way to make it work with Microsoft XML, v6.0?

Any input would be appreciated.

Stavros Jon
  • 1,695
  • 2
  • 7
  • 17

2 Answers2

4

To extend @CindyMeister's answer, the issue does appear to be namespace handling between the MSXML versions using getElementsByTagName(). Specifically, your XML maintains an xmlns attribute without colon identified prefix which requires DOM libraries to assign a prefix when parsing content:

<Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0" ...

However, using SelectionNamespaces + SelectNodes to define a temporary alias, such as doc, to default namespace prefix, both libraries print out expected results. And MS docs even advises the latter method (emphasis added):

The getElementsByTagName method simulates the matching of the provided argument against the result of the tagName property of IXMLDOMElement. When executed, it does not recognize or support namespaces. Instead, you should use the selectNodes method, which is faster in some cases and can support more complex searches.

MXSML v3.0 (prints unexpected getElementsByTagName result)

Sub ParseXMLv3()
    Dim respXML As New MSXML2.DOMDocument30

    respXML.Load "C:\Path\To\Input.xml"
    respXML.setProperty "SelectionLanguage", "XPath"
    respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"

    Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
    Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
    Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 24

    Set respXML = Nothing
End Sub

MSXML v6.0

Sub ParseXMLv6()
    Dim respXML As New MSXML2.DOMDocument60

    respXML.Load "C:\Path\To\Input.xml"
    respXML.setProperty "SelectionLanguage", "XPath"
    respXML.setProperty "SelectionNamespaces", "xmlns:doc='urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0'"

    Debug.Print respXML.SelectNodes("//doc:price.amount").Length       ' PRINTS 24
    Debug.Print respXML.SelectNodes("//price.amount").Length           ' PRINTS 0
    Debug.Print respXML.getElementsByTagName("price.amount").Length    ' PRINTS 0

    Set respXML = Nothing
End Sub
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    It's a pure pleasure to read such clearly structured and helpful answers +1). BTW Personally I'd prefer MSXML2 vers 6.0, as vers 3.0 wouldn't be supported automatically on many newer systems. – T.M. Sep 20 '19 at 18:44
  • Ok so apparently this is well documented, it's just me who's clueless. Can you please briefly explain to me what setting the `SelectionNamespaces` property really achieves? I understand that in the initial XML there should have been a prefix included in the `xmlns` attribute. So you basically add a user-defined one (doc)? What if there were multiple namespaces without prefixes? – Stavros Jon Sep 20 '19 at 19:51
  • 1
    The default namespace is a general XML situation and goes well beyond MSXML but includes all other compliant DOM libraries for Java, .NET, PHP, Python, etc. If parsing this exact XML, you will have to run the same process. There is nothing wrong with your XML. It is perfectly well-formed to have a default namespace (i.e., no prefix). However, there can only be one default namespace at a given node level. Adding a default at child levels requires multiple `SelectionNamespaces` where you use the user-defined one closest to your parsed need (i.e., *price.amount*) – Parfait Sep 20 '19 at 20:26
  • @Parfait hm...And how would I avoid hardcoding the `xmlns` attribute value? – Stavros Jon Sep 20 '19 at 20:59
  • 1
    Hmmmm...it is only one line to add `respXML.setProperty "SelectionNamespaces", ...`. – Parfait Sep 20 '19 at 21:28
  • @Parfait I was referring to something like `respXML.setProperty "SelectionNamespaces", "xmlns:doc='" & respXML.namespaces(0) & "'"`. Anyway, thanks again! :) – Stavros Jon Sep 23 '19 at 09:18
3

A quick test, here, indicates that none of the nodes/elements are picked up using DOMDocument60.

I am successful using DOMDocument30, still using the MSXML6 parser. So that could be a workaround for you:

'Using the MSXML6 parser, it's still possible to use what worked in older versions
Dim respXML As Msxml2.DOMDocument30
Set respXML = CreateObject("MSXML2.DOMDocument.3.0")

Research on the Internet turns up two useful links, one on MSDN, the other on VB forums.

The first basically says that security properties were added in MSXML6, meaning some things that worked in MSXML2 no longer will in the newer version. These are documented on Microsoft^s site.

I don't know which one it is (if any of these, but closest appears to be the SelectionNamespace property) but another change appears to be how the parser handles "anonymous" namespaces (VB Forums link). If a namespace is declared in a top-level element, with no prefix, then it's not applied to any child elements - so they aren't "seen".

As the XML code in the question contains a namespace with no prefix, this appears to be the issue. If declaring DOMDocument30 won't work for you, and SelectionNamespace doesn't help, then I think the only recourse would be to change/transform the XML to add a prefix for the namespace and to all the elements.

Cindy Meister
  • 25,071
  • 21
  • 34
  • 43
  • Ok so apparently this is well documented, it's just me who's clueless https://support.microsoft.com/en-in/help/288147/how-to-use-xpath-to-query-against-a-user-defined-default-namespace Anyway, thank you for being the 1st to point me to the right direction +1 – Stavros Jon Sep 20 '19 at 19:34
  • BTW using a reference to MSXML v6.0 and trying to use `Dim respXML As Msxml2.DOMDocument30` will give me the classic `User defined type not defined` error. – Stavros Jon Sep 20 '19 at 19:38
  • @StavrosJon That's very odd, because I did test it before posting and the code in the question is copy/pasted from my test. I'll have another look at it tomorrow to see if I have any "stray" references in my test project. – Cindy Meister Sep 20 '19 at 19:59
  • @StavrosJon It works for me, as posted. I have a reference only to "Microsoft XML, v6.0" and no other MSXML parser library. Don't know why you'd be seeing that error and I would not... – Cindy Meister Sep 23 '19 at 11:58
  • It's the early binding that doesn't work. Maybe what you meant was `Dim respXML As object` – Stavros Jon Sep 23 '19 at 12:16
  • @StavrosJon No, it works for me just fine. I have an active reference to that library. – Cindy Meister Sep 23 '19 at 12:21
  • With a reference only to MSXML v6.0 (and not to MSXML v3.0), what you suggested only works with late binding ``\_(ツ)_/`` – Stavros Jon Sep 23 '19 at 12:28
  • @StavrosJon Except it does for me. No idea why it would be different for you... In any case: I think it would be important to include the link you found in one of the answers. Or you post a third answer with the content of that article. Once we know how that will be incorporated I'll add a "note" to my answer about early-binding not working under all circumstances. – Cindy Meister Sep 23 '19 at 13:05