using beautifulsoup 4 for xml causes strange behaviour (memory issues?)

Question

I'm getting strange behaviour with this

>>> from bs4 import BeautifulSoup

>>> smallfile = 'small.xml'      #approx 600bytes
>>> largerfile = 'larger.xml'    #approx 2300 bytes
>>> len(BeautifulSoup(open(smallfile, 'r'), ['lxml', 'xml']))
1
>>> len(BeautifulSoup(open(largerfile, 'r'), ['lxml', 'xml']))
0

Contents of small.xml:

<?xml version="1.0" encoding="us-ascii"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>

largerfile is simply the smaller file, but padded with spaces and newlines (inbetween the last two tags in case it's relevant). IE the structure and contents of the xml should be identical for both cases.

On rare occasions processing largerfile will actually yield a partial result where only a small portion of the xml has been parsed. I can't seem to reliably recreate the circumstances.

Since BeautifulSoup uses lxml, I tested to see if lxml could handle the files independently. lxml appeared to be able to parse both files.

>>> from lxml import etree
>>> tree = etree.parse(smallfile)
>>> len(etree.tostring(tree))
547
>>> tree = etree.parse(largerfile)
>>> len(etree.tostring(tree))
2294

I'm using

netbook with 1gb ram
windows 7
lxml 2.3 (had some trouble installing this, I hope a dodgy installation isn't causing the problem)
beautiful soup 4.0.1
python 3.2 (I also have python 2.7x installed, but have been using 3.2 for this code)

What could be preventing the larger file from being processed properly? My current suspicion is some weird memory issue, since the file size seems to make a difference, perhaps in conjunction with some bug in how BeautifulSoup 4 interacts with lxml.

Edit: to better illustrate...

>>> smallsoup = BeautifulSoup(smallfile), ['lxml', 'xml'])
>>> smallsoup
<?xml version="1.0" encoding="utf-8"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>
>>> largersoup = BeautifulSoup(largerfile, ['lxml', 'xml'])
>>> largersoup
<?xml version="1.0" encoding="utf-8"?>

>>>

>>> repr(open(largefile, 'r').read())
'\'<?xml version="1.0" encoding="us-ascii"?>\\n<Catalog>\\n<CMoverMissile id="HunterSeekerMissile">\\n<MotionPhases index="1">\\n<Driver value="Guidance"/>\\n<Acceleration value="3200"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-4.5,-4.25"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n<MotionPhases index="2">\\n<Driver value="Guidance"/>\\n<Acceleration value="4"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-2.25,-2"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n</CMoverMissile>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        </Catalog>\''

note: there are many spaces (which probably won't show up in the browser) between and \''

my suspicion (which I didn't check because of laziness) is the running len on a beautifulsoup object doesn't return bytes but number of nodes or something similiar. certainly this isn't a memory issue' — alonisser, Mar 23 '12 at 10:58
I think what BS4 calls `'lxml'` is the lxml HTML parser, as BeautifulSoup is primarily used for HTML. Try it with just `'xml'` if you only want xml parsed. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-xml — Thomas K, Mar 23 '12 at 12:52
@alonisser posted the contents of the xml (after changing them so that the larger file is the same as the smaller, but with white space padding). — chobok, Mar 23 '12 at 13:38
@Thomas K BeautifulSoup4 currently uses lxml as its only xml parser. Using just "xml" leads to Beautiful Soup using lxml anyway. Nevertheless I've tried both BeautifulSoup(markup, ["lxml", "xml"]) and BeautifulSoup(markup, "xml"). Both give me the same results. http://www.crummy.com/software/BeautifulSoup/bs4/doc/#parser-installation — chobok, Mar 23 '12 at 13:48
possible duplicate at http://stackoverflow.com/questions/9622474/beautifulsoup-xml-only-printing-first-line — chobok, Mar 23 '12 at 14:36
I can reproduce it on `bs4 4.0.0b8`. It seems the last line is too long for bs4 (both `cElementTree` and `lxml` parse the bigfile without a problem). [Here's small example that demonstrate the problem](http://ideone.com/v5VeK) — jfs, Mar 23 '12 at 15:42
@J.F.Sebastian that was really helpful. Turns out the author of BS4 has acknowledged the bug. I'll post this as the answer to the question. — chobok, Mar 24 '12 at 03:20

score 2 · Answer 1 · answered Mar 23 '12 at 14:26

2

len(soup) returns len(soup.contents) i.e., the number of immediate children (in this case a single child <Catalog>).

BeautifulSoup fails to parse largerfile so len(soup) == 0

answered Mar 23 '12 at 14:26

jfs

399,953
195
994
1,670

my question is really "What could be preventing the larger file from being processed properly?" – chobok Mar 23 '12 at 15:13
I think it is processed properly - and you are checking the wrong thing. but you can open a more exact new question with the specific problem, because the len isn't the problem – alonisser Mar 23 '12 at 18:56
@alonisser as mentioned in the answer I just posted, the author of bs4/BeautifulSoup recognises that there is a bug in bs4/lxml. This solves my question. Would reposting a better formatted question still be useful to you or the Stackoverflow community? – chobok Mar 24 '12 at 06:59

chobok · Accepted Answer · 2012-03-27T06:36:54.077

It turns out the problem lies somewhere with BS4/LXML. The author of BS4 (BeautifulSoup), recognises the problem (https://groups.google.com/group/beautifulsoup/browse_thread/thread/24a82209aca4c083):

"Apparently BS4+lxml won't parse an XML document that's longer than about 550 bytes. I only tested it with small documents. The BS4 handler code is not even being called, which makes it hard to debug, but it's not a guarantee the problem is on the lxml side."

A slight tweak to J.F.Sebastian helpful code sample gives the size at which the code fails:

>>> from bs4 import BeautifulSoup
>>> from itertools import count
>>> for n in count():
    s = "<a>" + " " * n + "</a>"
    nchildren = len(BeautifulSoup(s, 'xml'))
    if nchildren != 1: # broken
       print(len(s)) 
       break

1092

The code processes the xml as expected for a character count of less than or equal to 1091. XML of a string longer than or equal to 1092 usually fails.

UPDATE: BeautifulSoup 4.0.2 has been released with a workaround:

"This new version works around what appears to be a bug in lxml's XMLParser.feed(), which was preventing BS from parsing XML documents larger than about 512-1024 characters. "

I'll keep updates in comments, until the problem is resolved. Update: BS4 Author: "It looks like a bug in lxml. https://bugs.launchpad.net/lxml/+bug/963936 I've put a workaround in bzr. I'll probably release a 4.0.2 on Monday with the workaround, but I want to see what the lxml developers say. " https://groups.google.com/group/beautifulsoup/browse_thread/thread/24a82209aca4c083 — chobok, Mar 25 '12 at 06:50
Seems still broken in my case. I have na up-to-date BS4, up-to-date lxml lib and it still fails for larger input. I can see the difference between older lxml version (pre-2.3.6) where the older ones show only few lines and the contemporary versions (tried both 3.0.1 and 2.3.6+) where there is no output at all except for part. How did you guys solved it? My XML is valid but not parsed :( — xaralis, Oct 15 '12 at 13:28

score 0 · Answer 3 · answered Mar 23 '12 at 11:38

0

after I checked that it seems that running len on a beautifulsoup object doesn't return the byte length but some other kind of property (node depth or something else.. not quite sure)

answered Mar 23 '12 at 11:38

alonisser

11,542
21
85
139

His example isn't running `len()` on the object, but on the `etree.tostring()` of that object, so it actually is number of unicode characters in the serialization (which isn't quite number of bytes, but close enough). – Charles Duffy Mar 23 '12 at 14:09
I actually ran len on the BeautifulSoup objects as well. The aim was just to illustrate that the xml hadn't been parsed for the larger file. – chobok Mar 23 '12 at 14:26
@CharlesDuffy actually - he is running len on the object and then comparing it to running len on etree.tostring() which are two very different things.. – alonisser Mar 23 '12 at 18:57
@alonisser I run len() on the BS4 objects to test if the file has been been parsed. I expect len(BeautifulSoup(open(xmlfile, 'r'), ['lxml', 'xml'])) to return 1 if the xml if BeautifulSoup was successful in parsing the file. In retrospect there were probably better ways to illustrate what I was trying to achieve. – chobok Mar 24 '12 at 06:56

using beautifulsoup 4 for xml causes strange behaviour (memory issues?)

3 Answers3

Linked