I'm getting strange behaviour with this
>>> from bs4 import BeautifulSoup
>>> smallfile = 'small.xml' #approx 600bytes
>>> largerfile = 'larger.xml' #approx 2300 bytes
>>> len(BeautifulSoup(open(smallfile, 'r'), ['lxml', 'xml']))
1
>>> len(BeautifulSoup(open(largerfile, 'r'), ['lxml', 'xml']))
0
Contents of small.xml:
<?xml version="1.0" encoding="us-ascii"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>
largerfile is simply the smaller file, but padded with spaces and newlines (inbetween the last two tags in case it's relevant). IE the structure and contents of the xml should be identical for both cases.
On rare occasions processing largerfile will actually yield a partial result where only a small portion of the xml has been parsed. I can't seem to reliably recreate the circumstances.
Since BeautifulSoup uses lxml, I tested to see if lxml could handle the files independently. lxml appeared to be able to parse both files.
>>> from lxml import etree
>>> tree = etree.parse(smallfile)
>>> len(etree.tostring(tree))
547
>>> tree = etree.parse(largerfile)
>>> len(etree.tostring(tree))
2294
I'm using
- netbook with 1gb ram
- windows 7
- lxml 2.3 (had some trouble installing this, I hope a dodgy installation isn't causing the problem)
- beautiful soup 4.0.1
- python 3.2 (I also have python 2.7x installed, but have been using 3.2 for this code)
What could be preventing the larger file from being processed properly? My current suspicion is some weird memory issue, since the file size seems to make a difference, perhaps in conjunction with some bug in how BeautifulSoup 4 interacts with lxml.
Edit: to better illustrate...
>>> smallsoup = BeautifulSoup(smallfile), ['lxml', 'xml'])
>>> smallsoup
<?xml version="1.0" encoding="utf-8"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>
>>> largersoup = BeautifulSoup(largerfile, ['lxml', 'xml'])
>>> largersoup
<?xml version="1.0" encoding="utf-8"?>
>>>
>>> repr(open(largefile, 'r').read())
'\'<?xml version="1.0" encoding="us-ascii"?>\\n<Catalog>\\n<CMoverMissile id="HunterSeekerMissile">\\n<MotionPhases index="1">\\n<Driver value="Guidance"/>\\n<Acceleration value="3200"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-4.5,-4.25"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n<MotionPhases index="2">\\n<Driver value="Guidance"/>\\n<Acceleration value="4"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-2.25,-2"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n</CMoverMissile> </Catalog>\''
note: there are many spaces (which probably won't show up in the browser) between and \''