2

I am new to XML. i downloaded a XML file, called ipg140722,from google drive, http://www.google.com/googlebooks/uspto-patents-grants-text.html , I used Window 8.1, R 3.1.1,

library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)

XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document

  what is the problem
user3904239
  • 23
  • 1
  • 4
  • 1
    Surely you can't expect us to help you without seeing the document?? Upload it somewhere and provide a link in your question. – jlhoward Aug 03 '14 at 18:10
  • 1
    I guess its from [here](http://www.google.com/googlebooks/uspto-patents-grants-text.html) and on linux the unziped content `grep -c "xml version" ipg140722.xml` has 6984 XML documents. Again on linux one could use awk to [break these into separate files](http://stackoverflow.com/questions/18472425/awk-split-files-into-smaller-files-on-pattern), but probably it's time to ask what the intention is? – Martin Morgan Aug 03 '14 at 18:31

1 Answers1

9

Note: This post is edited from the original version.

The object lesson here is that just because a file has an xml extension does not mean it is well formed XML.

If @MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.

lines   <- readLines("ipg140722.xml")
start   <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end     <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
  txt <- paste(lines[start[i]:end[i]],collapse="\n")
  # print(i)
  xmlTreeParse(txt,asText=T)
  # return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"

So now docs is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]], or collectively using something like the code below, which extracts the invention title from each document.

sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
#  [1] "Phallus retention harness"                          "Dress/coat"                                        
#  [3] "Shirt"                                              "Shirt"                                             
#  [5] "Sandal"                                             "Shoe"                                              
#  [7] "Footwear"                                           "Flexible athletic shoe sole"                       
#  [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"                                         

And no, I did not make up the name of the first patent.

Response to OPs comment

My original post, which detected the start of a new document using:

start   <- grep("xml version",lines,fixed=T)

was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...) and run the code above with

docs <- lapply(1:length(start),get.xml)

you will see that all 6961 documents parse correctly.

But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...) and discard the parsed document at each step.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Another thing puzzled me is that I need to parse all the patents so that I can covert them as csv file. Your solution parsed 10 patents, how can I know how many patents neeed to be lapply. Sorry, I am new to XML, or R – user3904239 Aug 04 '14 at 03:24
  • Use `docs <- lapply(1:length(start),get.xml)` but be forewarned - this will take a *long* time. There are almost 7000 patents. – jlhoward Aug 04 '14 at 03:36
  • i tried the code you told me. However, it shows error and seems there are problems related with the end tag Error: 1: Premature end of data in tag row line 6264 2: Premature end of data in tag tbody line 6263 3: Premature end of data in tag tgroup line 6256 4: Premature end of data in tag table line 6255 5: Premature end of data in tag tables line 6254 6: Premature end of data in tag p line 6253 7: Premature end of data in tag description line 3634 8: Premature end of data in tag us-patent-grant line 3 – user3904239 Aug 04 '14 at 07:11
  • Just out of curiousity, why `which(grepl(...))` and not `grep(...)`? – Rich Scriven Aug 04 '14 at 17:20
  • @RichardScriven You're right - they both yield identical results. I'm just more used to `grepl(...)`. But `grep(...)` is simpler; I'll change it. – jlhoward Aug 04 '14 at 18:05
  • Probably not much difference anyhow. `.Internal(grep)` probably calls `.Internal(which)` or `match` at some point. – Rich Scriven Aug 04 '14 at 18:24
  • @RichardScriven Actually, when you brought it up I profiled both using `microbenchmark(...)`. The median execution time for both is the same to within 0.1%. – jlhoward Aug 04 '14 at 18:49
  • @jlhoward thx, it worked. But I am afraid I have to bother you again with the subsequent problems. I parsed the first 1000 patents using this : docs1 <- lapply(1:1000,get.xml) but docs1 is a list, I tried used loop and rbind to xmlRoot, yet error showed again. > for(i in 1:1000){ + rootmode<-xmlRoot(docs1[[i]]) + rootdata<-rbind (rootdata,rootmode) + } Error in xmlChildren(x)[[...]] : subscript out of bounds ## how can I get into xml node and extract them as lists – user3904239 Aug 05 '14 at 19:06
  • @jlhoward Also, I tried to convert these lists into a csv file. Somehow, I succeeded, but the csv file only consisted of two cols and that is not what I want. – user3904239 Aug 05 '14 at 19:10
  • Right, as the answer says, docs1 will be a list of parsed XML documents. To access, e.g., the first document in that list, use `doc1[[1]]`. It's not clear what you mean by "convert to a csv file". IMO you should post a new question with an explanation of that this csv file should look like. Then maybe someone will show you how to do it. – jlhoward Aug 05 '14 at 19:28