beautifulsoup soup 4 fails to read all of the html

Question

I have a webpage to fetch, when I fetch it with urllib and print the contents, I see the real content length, But after I parse the html with bs4, I see at least 5 blocks of divs are not included to bs4 parsed html, when I parse the html with beautifulsoup, I see the real content, and divs are included, I don't know where is the mistake, but all I see is, bs4 removes some of the divs that are needed by itself, how can I solve this issue ?, here is my sample,

#This one does not remove some neccessary parts, This is okay

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.urlopen("http://example").read())


#But this one removes some neccessary parts, This is not okay

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.urlopen("http://example").read())

thank you

This depends on the (pluggable) parser used; do you have `lxml` installed? — Martijn Pieters, Sep 04 '13 at 08:34
Also see [Beautiful Soup findAll doen't find them all](http://stackoverflow.com/q/16322862) and [BeautifulSoup fails to parse long view state](http://stackoverflow.com/q/18150786) — Martijn Pieters, Sep 04 '13 at 08:37
In summary, I suspect that the `lxml` dependencies on your system are somewhat broken. — Martijn Pieters, Sep 04 '13 at 08:38
@MartijnPieters should I use html.parser for that purpose ? I don't have lxml installed on my system, should I install it ? — user2682790, Sep 04 '13 at 08:46
Perhaps, it could be `HTMLLib` parser is at fault here. See the last link where I test 3 different parsers in a loop. `html5lib` might be an idea as well. — Martijn Pieters, Sep 04 '13 at 08:48
@MartijnPieters thank you very much, I will try all the options — user2682790, Sep 04 '13 at 08:49
I had same problem, look into this http://stackoverflow.com/questions/15290991/beautifulsoup-not-reading-ill-formed-html — rajpy, Sep 04 '13 at 08:59

beautifulsoup soup 4 fails to read all of the html

0 Answers0