parse xml with many roots using BeautifulSoup

Question

I am trying to parse a large xml file downloaded from Google using BS4. However, the file is constructed with many roots so that the xml parser can only parse in the first block.

I load the file using the following command

xml = BeautifulSoup("test.xml", "xml")

The test.xml file looks like below, it has many roots:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-24.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

.......

The html parser can read in the full file. However, a regular such file contains over 10k roots. Reading using html parser is slow and eats all my memory. Is there a way to get around this problem?

Any help is appreciated.

What's your code after that? i.e. how are you trying to retrieve the blocks? — khampson, Nov 21 '14 at 04:00
What do you mean by "the first chunk of the file?" Also, can you provide a sample XML file (via a link or otherwise). — Austin Hartzheim, Nov 21 '14 at 04:01
@khampson, I just print my xml and it only has the first block, instead of the full file. I am suspecting the second line of the tag may be the problem, but I know little of xml format. — Zhen Sun, Nov 21 '14 at 04:01
Oh, OK. So really what you have there is multiple xml files concatenated into one. That's really how they're provided by the Google API? That seems unusual... As @GuyGavriely suggested, *lxml* would be a good choice, since it's a Python wrapper around a C-based parser, which should be much faster. — khampson, Nov 21 '14 at 05:23
@ZhenSun Because you specifically mentioned me, I will note that the "multiple root" issue that Guy Gavriely explains below was the motivation behind my question. Because of that issue, I wasn't able to (in my brief attempt) make `lxml` parse the document either. It might be easier to reformat the document instead. Otherwise, you might consider trying one of [these XML parsers](https://wiki.python.org/moin/PythonXml). — Austin Hartzheim, Nov 21 '14 at 06:13

Guy Gavriely · Answer 1 · 2014-11-21T04:10:04.300

1

a valid xml file has only one root, either add that single root to the file or tell the parser to parse it as "html" (this is the default) for example:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(open("test.xml"), "xml")
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<us-patent-grant lang="EN">
1
</us-patent-grant>
>>> BeautifulSoup(open("test.xml"))
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<html><body><p>]&gt;
<us-patent-grant lang="EN">
1
</us-patent-grant>
<us-patent-grant lang="EN">
2
</us-patent-grant>
</p></body></html>
>>>

edited Nov 21 '14 at 04:10

answered Nov 21 '14 at 04:03

Guy Gavriely

11,228
6
27
42

Thanks! yeah "html" parser can read in the full file. However, a regular file contains about 10k such blocks. Reading using "html" takes for ever and eats all my memory. I am wondering whether a correct way of "xml" parser can improve on that. – Zhen Sun Nov 21 '14 at 04:25
for big files consider using `lxml` http://lxml.de/ or break that file to smaller files or add that single root as suggeted – Guy Gavriely Nov 21 '14 at 04:28
How do I add a single root to the file? Do I need to remove all the other tags in the file? – Zhen Sun Nov 21 '14 at 04:39
if that header line, the one that start with – Guy Gavriely Nov 21 '14 at 04:48
Thanks. Then I don't see an easy way of doing this. Is there a fast way of reading the ``xml`` file as a text file and then find and change? – Zhen Sun Nov 21 '14 at 04:57
using `sed` http://stackoverflow.com/questions/5410757/delete-a-line-containing-a-specific-string-using-sed or you have to use python? – Guy Gavriely Nov 21 '14 at 05:07
I am not familiar with that ``sed``... Thanks for your help! – Zhen Sun Nov 21 '14 at 05:10

parse xml with many roots using BeautifulSoup

1 Answers1