I'm having a problem using lapply
and xml_find_first
from the xml2
package to pull nodes from a list of xml objects. I'm pulling a few thousand records from the Scopus API. Since I can only get 25 records at a time, I run it so I get a list with 100+ elements of 25 records each. I know a few of the records have missing values, so my goal is to shuffle things around until I get a list were each record is its own element, then use lapply
and xml_find_first
so that I'll get null values where appropriate. The problem is that I end up pulling repeated values as if everything is still nested in their initial lists.
Here's a reproducible example with a list of 2 elements with 2 records each, with citedby-count
missing from the last one:
```{r}
library(xml2)
# Simulate how data come in from Scopus
# Build 2 list elements, 2 entries each
el1 <- read_xml(
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-1542382496</eid>
<citedby-count>9385</citedby-count>
</entry>
<entry>
<eid>2-s2.0-0032721879</eid>
<citedby-count>4040</citedby-count>
</entry>
</feed>"
)
el2 <- read_xml( # This one's missing citedby-count for last entry
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-0041751098</eid>
<citedby-count>3793</citedby-count>
</entry>
<entry>
<eid>2-s2.0-73449149291</eid>
</entry>
</feed>"
)
# Combine into list
lst <- list(el1,el2)
# Check
lst
```
This gives me:
My goal is to pull out the entries so they are list items. This way, xml_find_first
should stick a null value in for the entry where citedby-count
is missing.
```{r}
# Pull entry nodes
lst2 <- lapply(lst, xml_find_all, "//entry")
# Unlist
lst2 <- unlist(lst2, recursive=FALSE)
# Check - each entry is its own element
lst2
```
The hangup is when I try to extract a node that I know is missing in some of the entries in a way that will leave a null where it's missing. xml_find_first
should do that. But...
```{r}
cbc <- lapply(lst2, xml_find_first, "//citedby-count")
cbc <- lapply(cbc, xml_text)
cbc # Repeats the first values of original nesting
```
So I checked what would happen with xml_find_all
:
```{r}
cbc2 <- lapply(lst2, xml_find_all, "//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2 # Elements contain all values from initial nesting
```
Which makes no sense in comparison with the output of lst2
above. For some reason, pulling the text retains the values from the initial nesting, even though it doesn't show up when looking at the final list of xml objects. I'm stumped.