Incorrectly extracting repeated values from list of XML objects in R

Question

I'm having a problem using lapply and xml_find_first from the xml2 package to pull nodes from a list of xml objects. I'm pulling a few thousand records from the Scopus API. Since I can only get 25 records at a time, I run it so I get a list with 100+ elements of 25 records each. I know a few of the records have missing values, so my goal is to shuffle things around until I get a list were each record is its own element, then use lapply and xml_find_first so that I'll get null values where appropriate. The problem is that I end up pulling repeated values as if everything is still nested in their initial lists.

Here's a reproducible example with a list of 2 elements with 2 records each, with citedby-count missing from the last one:

```{r}
library(xml2)

# Simulate how data come in from Scopus
# Build 2 list elements, 2 entries each
el1 <- read_xml(
"<feed>
  <blah>Bunch of stuff I don't need</blah>
  <blah>Bunch of other stuff I don't need</blah>
  <entry>
    <eid>2-s2.0-1542382496</eid>
    <citedby-count>9385</citedby-count>
  </entry>
  <entry>
    <eid>2-s2.0-0032721879</eid>
    <citedby-count>4040</citedby-count>
  </entry>
</feed>"
)
el2 <- read_xml( # This one's missing citedby-count for last entry
"<feed>
  <blah>Bunch of stuff I don't need</blah>
  <blah>Bunch of other stuff I don't need</blah>
  <entry>
    <eid>2-s2.0-0041751098</eid>
    <citedby-count>3793</citedby-count>
  </entry>
  <entry>
    <eid>2-s2.0-73449149291</eid>
  </entry>
</feed>"
)
# Combine into list
lst <- list(el1,el2)
# Check
lst
```

This gives me:

My goal is to pull out the entries so they are list items. This way, xml_find_first should stick a null value in for the entry where citedby-count is missing.

```{r}
# Pull entry nodes
lst2 <- lapply(lst, xml_find_all, "//entry")
# Unlist
lst2 <- unlist(lst2, recursive=FALSE)
# Check - each entry is its own element
lst2
```

The hangup is when I try to extract a node that I know is missing in some of the entries in a way that will leave a null where it's missing. xml_find_first should do that. But...

```{r}
cbc <- lapply(lst2, xml_find_first, "//citedby-count")
cbc <- lapply(cbc, xml_text)
cbc # Repeats the first values of original nesting
```

So I checked what would happen with xml_find_all:

```{r}
cbc2 <- lapply(lst2, xml_find_all, "//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2 # Elements contain all values from initial nesting
```

Which makes no sense in comparison with the output of lst2 above. For some reason, pulling the text retains the values from the initial nesting, even though it doesn't show up when looking at the final list of xml objects. I'm stumped.

To solve your problem use a "." before the //. From the xml2 documentation: "# Note the difference between .// and // # // finds anywhere in the document (ignoring the current node) # .// finds anywhere beneath the current node" — Dave2e, May 08 '18 at 17:45
@Dave2e Oh, wow, I remember reading that weeks ago and it completely slipped my mind. Yep, that worked. If you want to submit as an answer, I'll accept it. Thanks! — bcarothers, May 08 '18 at 18:17

score 1 · Accepted Answer · answered May 08 '18 at 18:30

Indeed, as @Dave2e comments, do not simply use the "anywhere" XPath search (specifically the descendant-or-self search) with // for child elements as the search will run on entire document.

How can this be if I do not explicitly call the original document? If you run str() on any of your xml_find lists, you will see the object carries Rcpp external pointers to the current node and document available for recall as needed. In fact, I believe the node pointer displays when calling the list.

str(ls2)    
# List of 4
#  $ :List of 2
#   ..$ node:<externalptr> 
#   ..$ doc :<externalptr> 
#   ..- attr(*, "class")= chr "xml_node"
#  $ :List of 2
#   ..$ node:<externalptr> 
#   ..$ doc :<externalptr> 
#   ..- attr(*, "class")= chr "xml_node"
#  $ :List of 2
#   ..$ node:<externalptr> 
#   ..$ doc :<externalptr> 
#   ..- attr(*, "class")= chr "xml_node"
#  $ :List of 2
#   ..$ node:<externalptr> 
#   ..$ doc :<externalptr> 
#   ..- attr(*, "class")= chr "xml_node"

lst2[[1]]$doc
# <pointer: 0x000000000ca7ff90>

typeof(lst2[[1]]$doc)
# [1] "externalptr"

Therefore, be careful of context when searching. You can use the dot prefix (as @Dave2e advises), .//, or no slashes at all for retrieval of child elements which here will be equivalent.

cbc2 <- lapply(lst2, xml_find_all, "citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2

# [[1]]
# [1] "9385"

# [[2]]
# [1] "4040"

# [[3]]
# [1] "3793"

# [[4]]
# character(0)


cbc2 <- lapply(lst2, xml_find_all, ".//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"

# [[2]]
# [1] "4040"

# [[3]]
# [1] "3793"

# [[4]]
# character(0)

Do note the .// will search ALL descendants (i.e., children, grandchildren, etc.) starting at current node. See What is the difference between .// and //* in XPath?

Incorrectly extracting repeated values from list of XML objects in R

1 Answers1