1

I am trying to parse a large xml file downloaded from Google using BS4. However, the file is constructed with many roots so that the xml parser can only parse in the first block.

I load the file using the following command

xml = BeautifulSoup("test.xml", "xml")

The test.xml file looks like below, it has many roots:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-24.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>

.......

The html parser can read in the full file. However, a regular such file contains over 10k roots. Reading using html parser is slow and eats all my memory. Is there a way to get around this problem?

Any help is appreciated.

Zhen Sun
  • 817
  • 3
  • 13
  • 20
  • What's your code after that? i.e. how are you trying to retrieve the blocks? – khampson Nov 21 '14 at 04:00
  • What do you mean by "the first chunk of the file?" Also, can you provide a sample XML file (via a link or otherwise). – Austin Hartzheim Nov 21 '14 at 04:01
  • @khampson, I just print my xml and it only has the first block, instead of the full file. I am suspecting the second line of the tag may be the problem, but I know little of xml format. – Zhen Sun Nov 21 '14 at 04:01
  • Oh, OK. So really what you have there is multiple xml files concatenated into one. That's really how they're provided by the Google API? That seems unusual... As @GuyGavriely suggested, *lxml* would be a good choice, since it's a Python wrapper around a C-based parser, which should be much faster. – khampson Nov 21 '14 at 05:23
  • @ZhenSun Because you specifically mentioned me, I will note that the "multiple root" issue that Guy Gavriely explains below was the motivation behind my question. Because of that issue, I wasn't able to (in my brief attempt) make `lxml` parse the document either. It might be easier to reformat the document instead. Otherwise, you might consider trying one of [these XML parsers](https://wiki.python.org/moin/PythonXml). – Austin Hartzheim Nov 21 '14 at 06:13

1 Answers1

1

a valid xml file has only one root, either add that single root to the file or tell the parser to parse it as "html" (this is the default) for example:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(open("test.xml"), "xml")
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<us-patent-grant lang="EN">
1
</us-patent-grant>
>>> BeautifulSoup(open("test.xml"))
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd">
<html><body><p>]&gt;
<us-patent-grant lang="EN">
1
</us-patent-grant>
<us-patent-grant lang="EN">
2
</us-patent-grant>
</p></body></html>
>>> 
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
  • Thanks! yeah "html" parser can read in the full file. However, a regular file contains about 10k such blocks. Reading using "html" takes for ever and eats all my memory. I am wondering whether a correct way of "xml" parser can improve on that. – Zhen Sun Nov 21 '14 at 04:25
  • for big files consider using `lxml` http://lxml.de/ or break that file to smaller files or add that single root as suggeted – Guy Gavriely Nov 21 '14 at 04:28
  • How do I add a single root to the file? Do I need to remove all the other tags in the file? – Zhen Sun Nov 21 '14 at 04:39
  • if that header line, the one that start with – Guy Gavriely Nov 21 '14 at 04:48
  • Thanks. Then I don't see an easy way of doing this. Is there a fast way of reading the ``xml`` file as a text file and then find and change? – Zhen Sun Nov 21 '14 at 04:57
  • using `sed` http://stackoverflow.com/questions/5410757/delete-a-line-containing-a-specific-string-using-sed or you have to use python? – Guy Gavriely Nov 21 '14 at 05:07
  • I am not familiar with that ``sed``... Thanks for your help! – Zhen Sun Nov 21 '14 at 05:10