I am trying to parse a large xml file downloaded from Google using BS4. However, the file is constructed with many roots so that the xml parser
can only parse in the first block.
I load the file using the following command
xml = BeautifulSoup("test.xml", "xml")
The test.xml file looks like below, it has many roots:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-24.dtd" [ ]>
<us-patent-grant lang="EN" .....>
A LOT of information
</us-patent-grant>
.......
The html
parser can read in the full file. However, a regular such file contains over 10k roots. Reading using html
parser is slow and eats all my memory. Is there a way to get around this problem?
Any help is appreciated.