I believe the problem lies in the files provided by Google. The "xml" file in the zip file is not valid XML. If you look at the unzipped file, you'll see it, quite properly, starts with the usual XML declaration on line 1:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
It then gets on with the data, starting with a us-patent-grant
root element, a few hundred lines of content, and closing that element on line 593:
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709266-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
[a few hundred lines omitted]
</us-patent-grant>
If that were the end of the file, you'd have well-formed XML. However, the ipg140722.xml
file is actually a series of well-formed XML files concatenated one after another:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709266-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
[a few hundred lines omitted]
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709267-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
[a few hundred lines omitted]
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v44-2013-05-16.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.4 2013-05-16" file="USD0709268-20140722.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20140707" date-publ="20140722">
[a few hundred lines omitted]
</us-patent-grant>
(etc)
The resulting concatenation is not well-formed XML and is presumably why R is choking.
If you look, you'll see a new XML declaration on lines 594, 1041, 1555, etc. all through the file to the end. If you paste lines 1-593, 594-1040 or 1041-1554 into an XML syntax checker, such as the one at http://www.w3schools.com/xml/xml_validator.asp , it will report "No errors found."
But try, for example, all of those lines, lines 1-1554, and you'll get an XML parsing error, "junk after document element".
You'll need to find some way to split the portion you need into a well-formed XML file you need in order to process it as XML.