0

I noticed a really annoying bug: BeautifulSoup4 (package: bs4) often finds less tags than the previous version (package: BeautifulSoup).

Here's a reproductible instance of that issue:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Output:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

The difference is not minor as you can see.

Here are the exact versions of the modules in case someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
halflings
  • 1,540
  • 1
  • 13
  • 34
  • I get `1701` for both. Perhaps try using `find_all` for `s4`, as that should be used for `bs4` – TerryA Jul 17 '13 at 11:42
  • 1
    BS4 uses a pluggable parser, and will switch to a 'better' parser if it is installed. If you have `lxml` installed for example results may well differ. Use the [`diagnose()` utility](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#troubleshooting) of BS4 to see why you see so few results. – Martijn Pieters Jul 17 '13 at 11:44
  • @Haidro: `.findAll()` is an alias for `.find_all()`; *the same code* is run either way. – Martijn Pieters Jul 17 '13 at 11:44
  • @MartijnPieters Really? Then why [did they bother](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names) creating `find_all` and the others? – TerryA Jul 17 '13 at 11:46
  • @Haidro: BS4 *renamed* the functions to be PEP 8 compliant, but retained backward compatibility by supplying aliases. A future version will probably drop those aliases. – Martijn Pieters Jul 17 '13 at 11:46
  • @MartijnPieters Ah, having a second look at the [specific section](http://www.python.org/dev/peps/pep-0008/#function-names), it does actually mention backward compatibility :p. Thanks! – TerryA Jul 17 '13 at 11:48
  • @MartijnPieters : Here's what diagnose give me: 'I noticed that html5lib is not installed. Installing it may help. Found lxml version 3.1.0.0 Trying to parse your markup with html.parser Here's what html.parser did with the markup:' And then the whole HTML code. Nothing noticeable. I'm gonna try to install html5lib and redo the test. EDIT: Oh well... That didn't help. Still '557' links. – halflings Jul 17 '13 at 11:51
  • @halflings: I cannot reproduce your output *at all*; I swapped between the parsers and all give me 1701, except for `html5lib` which gave me 0 for some as yet to be fathomed reason. – Martijn Pieters Jul 17 '13 at 11:53
  • 1
    @halflings: Upgraded from BS 4.2.0 to 4.2.1. Now `html5lib` gives me 1701 as well, but still cannot reproduce your problem. – Martijn Pieters Jul 17 '13 at 11:55
  • 1
    Non repro on BS 4.2.1, 4.2.0, 4.1.3 and 3.2.1 with and without html5lib - all 1701 – Jon Clements Jul 17 '13 at 11:56
  • Well, if that can help here's a screenshot of what I get: http://i.imgur.com/obPCqnr.png – halflings Jul 17 '13 at 11:58
  • @halflings: Did you try [specifying different parsers](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) when creating the soup? `bs4.BeautifulSoup(r.text, 'html.parser')`, etc. – Martijn Pieters Jul 17 '13 at 11:59
  • @halflings: And try to upgrade `lxml` to 3.2.1, the latest release. – Martijn Pieters Jul 17 '13 at 12:00
  • @halflings: As for the 'diagnose()` output, Beautifulsoup shows you the pretty-printed tree after parsing; paste the outputs *per parser* into a separate text file, then use `diff` or similar tools to see what differences there are in the output trees. If one parser gives you 1701 links but another gives you only 557 then diff between the outputs of those two to see where the mis-behaving parser fails. – Martijn Pieters Jul 17 '13 at 12:02
  • Alleluiah :-). Specifying a different parser (``html.parser`` for instance) solves the problem... I tried upgrading lxml but that didn't help. Could you suggest that answer @MartijnPieters ? (if you find it satisfactory enough) – halflings Jul 17 '13 at 12:03
  • @halflings: There you go; I included details on the lxml dependencies too; it could be you have an older `libxml2` version that has to take the ultimate blame for the parse failure. – Martijn Pieters Jul 17 '13 at 12:09

1 Answers1

9

You have lxml installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser option.

You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2 and libxslt which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.

Or explicitly specify the other parser when parsing the soup:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343