1

I must be missing something incredibly obvious and I have finally given up trying to figure out what is wrong. I'm trying to search a simple piece of XML to find all of the <Parent> nodes. I'm using R 3.2.2 and the XML package. Here's the code with the example XML:

library(XML)

example_xml <- paste(
  '<?xml version="1.0"?>',
    '<GetProductCategoriesForASINResponse xmlns="http://mws.amazonservices.com/schema/Products/2011-10-01">',
      '<GetProductCategoriesForASINResult>',
        '<Self>',
          '<ProductCategoryId>11056341</ProductCategoryId>',
          '<ProductCategoryName>Chicken</ProductCategoryName>',
          '<Parent>',
            '<ProductCategoryId>11056281</ProductCategoryId>',
            '<ProductCategoryName>Dog</ProductCategoryName>',
            '<Parent>',
              '<ProductCategoryId>11055991</ProductCategoryId>',
              '<ProductCategoryName>Monkey</ProductCategoryName>',
              '<Parent>',
                '<ProductCategoryId>11055981</ProductCategoryId>',
                '<ProductCategoryName>Frog</ProductCategoryName>',
                '<Parent>',
                  '<ProductCategoryId>3760911</ProductCategoryId>',
                  '<ProductCategoryName>Iguana</ProductCategoryName>',
                '</Parent>',
              '</Parent>',
            '</Parent>',
          '</Parent>',
        '</Self>',
      '</GetProductCategoriesForASINResult>',
    '<ResponseMetadata>',
      '<RequestId>abs123</RequestId>',
    '</ResponseMetadata>',
    '</GetProductCategoriesForASINResponse>',
    sep = ''
)

categories_xml <- xmlTreeParse(example_xml, useInternalNodes = TRUE)
root <- xmlRoot(categories_xml)
category_nodes <- getNodeSet(root, '//Parent')

I would expect category_nodes to contain 4 nodes but instead it is returning 0.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Matthew Crews
  • 4,105
  • 7
  • 33
  • 57
  • In defining the `xpath` you have to take into account the namespace (the second line of the file has the attribute `xmlns=...`). If you manually remove that attribute, you get the desired output with your code. – nicola Nov 21 '15 at 09:38
  • What do I do if I can't just manually remove it? I'm getting this from an API call and I'd rather not manually parse that out of the string. – Matthew Crews Nov 21 '15 at 09:39
  • 1
    Try this: `category_nodes <- getNodeSet(root, '//as:Parent', namespaces = c(as="http://mws.amazonservices.com/schema/Products/2011-10-01"))` – bergant Nov 21 '15 at 10:05
  • @bergant you should post an answer. Maybe `getNodeSet(root, '//as:Parent', namespaces = c(as=xmlNamespace(root)))` could be more elegant. – nicola Nov 21 '15 at 10:46
  • 1
    Thanks @nicola. `xmlNamespace(root)` is more elegant. Actually I was looking for a duplicate. I think http://stackoverflow.com/questions/24954792/xpath-and-namespace-specification-for-xml-documents-with-an-explicit-default-nam is pretty close? – bergant Nov 21 '15 at 10:58
  • Yes it is. However you can both answer and mark as a dup. – nicola Nov 21 '15 at 11:05
  • @nicola, if you go ahead and submit this answer I will mark it as the correct one. It did solve my problem. I would not consider this a duplicate since this problem is so much more simple than the other question. I looked at that question and still could not figure out what my problem was. – Matthew Crews Nov 21 '15 at 17:21
  • I think that @bergant should post. I'd wait for awhile before posting. Thank you. – nicola Nov 21 '15 at 17:39
  • nicola, ah, you are correct. Sorry I misread who was proposing the answer initially. @bergant, if you post your answer I will mark it as the correct one. – Matthew Crews Nov 21 '15 at 18:01

1 Answers1

1

You have to use the element with the namespace in the xpath expression:

getNodeSet(root, '//as:Parent', namespaces = c(as="http://mws.amazonservices.com/schema/Products/2011-10-01"))

and as nicola pointed out, you can get the namespace from the element, which gives you:

getNodeSet(root, '//as:Parent', namespaces = c(as=xmlNamespace(root)))
bergant
  • 7,122
  • 1
  • 20
  • 24