1

I have a file containing multiple XML declarations which I was able to detect and individually read them from this post: Parseing XML by R always return XML declaration error . The data comes from: https://www.google.com/googlebooks/uspto-patents-applications-text.html.

### read xml document with more than one <?xml declaration in R

lines   <- readLines("pa020829.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))

get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)

> class(docs)
[1] "list"
> class(docs[1])
[1] "list"
> class(docs[[1]])
[1] "XMLDocument"         "XMLAbstractDocument"

The file docs contains 10 similar documents called docs[[1]], docs[[2]], ... . I managed to extract the root of a single doc and to insert it into a matrix:

root <- xmlRoot(docs[[1]])

d <- rbind(unlist(xmlSApply(root[[1]], function(x) xmlSApply(x, xmlValue))))

However, I need to write code that would automatically retrieve the data of all 10 documents and attach them to a single data frame. I tried the code below but it only retrieves the data of the first document's root and attaches it multiple times to the matrix.

d <- lapply(docs, function(x) rbind(unlist(xmlSApply(root, function(x) xmlSApply(x, xmlValue)))))

I guess I need to change the way I call the root in the function.

Any idea on how to create a matrix with the data from all the documents?

makeyourownmaker
  • 1,558
  • 2
  • 13
  • 33
Amleto
  • 584
  • 1
  • 7
  • 25

1 Answers1

1

The following code will return a matrix containing the data from all the documents:

getXmlInternal <- function(x) {
  rbind(unlist(xmlSApply(xmlRoot(x), function(y) xmlSApply(y, xmlValue))))
}

d <- rbind(lapply(docs, function(x) getXmlInternal(x)))

This fixes the xmlRoot issue you mention by running that command on each of the documents supplied by the lapply command. The lapply command is wrapped in a call to rbind to ensure the output is in a matrix as requested.

The getXmlInternal function is included to make the answer a little more readable.

makeyourownmaker
  • 1,558
  • 2
  • 13
  • 33
  • perhaps explain a bit about what it's doing for the OP? perhaps also add some spacing (there are no extra points on SO for one-liners) – hrbrmstr Oct 12 '18 at 21:10
  • I have added some explanation and reformatted the code a little to improve readability. – makeyourownmaker Oct 12 '18 at 21:28
  • Thanks you @makeyourownmaker for this. Unfortunately it returns a 1 row x 10 columns matrix with lists inside each row. I need to unlist the characters inside the columns into multiple rows. Can you help me with that? – Amleto Oct 13 '18 at 17:45