0

I'm following the answer for Parseing XML by R always return XML declaration error

lines   <- readLines("ipg140722.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
  }
docs <- lapply(1:5,get.xml)
class(docs[[1]])

The code parses XML files from google patents (the file is here) and it appears to work in that I can selectively chose individual patents, however, when I submit the following:

 sapply(docs, function(doc) xmlValue(doc["//invention-title"][[1]]))
 [1] NA NA NA NA NA

It does not return a list of the invention titles as it does in the answer, but instead gives me five NA's. Any help would be appreciated.

If I provide the following command: docs[[2]]

It outputs the entire contents of patent second in the list. The relevant information that I want to extract is shown as:

<invention-title id="d2e73">Dress/coat</invention-title>

with "Dress/coat" being shown as one of the five NA's

Community
  • 1
  • 1
  • It would help if you could make a minimal [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) example that does not rely on downloading a 100MB xml file. I'm assuming you've extracted one document correctly. What does it look like? Does it use namespaces? – MrFlick May 15 '15 at 14:52

1 Answers1

0

I believe the problem lies in the files provided by Google. The "xml" file in the zip file is not valid XML. If you look at the unzipped file, you'll see it, quite properly, starts with the usual XML declaration on line 1:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>

It then gets on with the data, starting with a us-patent-grant root element, a few hundred lines of content, and closing that element on line 593:

<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709266-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
  [a few hundred lines omitted]
</us-patent-grant>

If that were the end of the file, you'd have well-formed XML. However, the ipg140722.xml file is actually a series of well-formed XML files concatenated one after another:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709266-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
  [a few hundred lines omitted]
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709267-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
  [a few hundred lines omitted]
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709268-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
  [a few hundred lines omitted]
</us-patent-grant>
(etc)

The resulting concatenation is not well-formed XML and is presumably why R is choking.

If you look, you'll see a new XML declaration on lines 594, 1041, 1555, etc. all through the file to the end. If you paste lines 1-593, 594-1040 or 1041-1554 into an XML syntax checker, such as the one at http://www.w3schools.com/xml/xml_validator.asp , it will report "No errors found."

But try, for example, all of those lines, lines 1-1554, and you'll get an XML parsing error, "junk after document element".

You'll need to find some way to split the portion you need into a well-formed XML file you need in order to process it as XML.

codingatty
  • 2,026
  • 1
  • 23
  • 32