24

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:

soup = bs4.BeautifulSoup(html, 'lxml')

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Adam Hammes
  • 820
  • 1
  • 8
  • 22
  • 1
    `lxml` *is* the default in `bs4`, assuming you have `lxml` installed. So unless you happen to be working with BeautifulSoup3... – roippi Jan 06 '15 at 00:57
  • I am using bs4, but I didn't know how to check which parser I was currently using. Thank you! – Adam Hammes Jan 06 '15 at 01:00
  • Related to https://stackoverflow.com/questions/33511544 containing other details. – bufh Jun 30 '20 at 13:57

2 Answers2

23

According to the Specifying the parser to use documentation page:

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

In other words, just installing lxml in the same python environment makes it a default parser.

Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.

If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.

Besides: "Explicit is better than implicit."

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Note, with bs4 version 4.5.1, when specifuing 'lxml' parser and not having it installed bs4 **does** error out: bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? – glexey Dec 15 '16 at 23:10
11

Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:

but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.

# V1 - define a wrapper function - most straight-forward.
import bs4

def bs_parse(html):
    return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)

Or if you feel like showing off ...

import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)
Community
  • 1
  • 1
Leonid
  • 622
  • 3
  • 9
  • 22
  • Can you add explanation to how `partial` works? Are there any advantages to using it over a wrapper function? – rer Aug 01 '17 at 22:52
  • 1
    @r3robertson There is good documentation for partial functions here: https://docs.python.org/2/library/functools.html#functools.partial It looks to me like partial is both slower and more complicated under the hood as compared to a wrapper, but once it has been implemented, it is rather easy to use. Partial functions are clean for a lack of a better word from mathematical point of view. Other languages have this and it is considered a good use of functional programming by some, but you do pay a price in terms of speed and an extra import. I still use partial functions because they are fun. – Leonid Aug 02 '17 at 03:49
  • 1
    @r3robertson By the way, partial functions do not have to create extra overhead when using certain compiled languages that make performance a priority, but Python is not one of those languages. C++ is however https://vittorioromeo.info/index/blog/cpp17_curry.html – Leonid Nov 05 '20 at 18:56