I have a file containing multiple XML declarations which I was able to detect and individually read them from this post: Parseing XML by R always return XML declaration error . The data comes from: https://www.google.com/googlebooks/uspto-patents-applications-text.html.
### read xml document with more than one <?xml declaration in R
lines <- readLines("pa020829.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
> class(docs)
[1] "list"
> class(docs[1])
[1] "list"
> class(docs[[1]])
[1] "XMLDocument" "XMLAbstractDocument"
The file docs contains 10 similar documents called docs[[1]], docs[[2]], ... . I managed to extract the root of a single doc and to insert it into a matrix:
root <- xmlRoot(docs[[1]])
d <- rbind(unlist(xmlSApply(root[[1]], function(x) xmlSApply(x, xmlValue))))
However, I need to write code that would automatically retrieve the data of all 10 documents and attach them to a single data frame. I tried the code below but it only retrieves the data of the first document's root and attaches it multiple times to the matrix.
d <- lapply(docs, function(x) rbind(unlist(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))))
I guess I need to change the way I call the root in the function.
Any idea on how to create a matrix with the data from all the documents?