75

When you use BeautifulSoup to scrape a certain part of a website, you can use

  • soup.find() and soup.findAll() or
  • soup.select().

Is there a difference between the .find() and the .select() methods? (e.g. In performance or flexibility, etc.) Or are they the same?

max
  • 3,915
  • 2
  • 9
  • 25
Dieter
  • 2,499
  • 1
  • 23
  • 41
  • 22
    `select()` accepts CSS selectors, `find()` does not – Andrea Corbellini Jun 25 '16 at 12:13
  • See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find and https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors – Andrea Corbellini Jun 25 '16 at 12:15
  • 1
    But i don't really understand the difference between them. Because for me, they can do the same things. And i would like to know the difference. (actually, i've a small preference for the .select()) – Dieter Jun 25 '16 at 12:40
  • select finds multiple instances and returns a list, find finds the first, so they don't do the same thing. `select_one` would be the equivalent to find. – Padraic Cunningham Jun 25 '16 at 13:00
  • @PadraicCunningham - but you can do select("div nth-of-type(1)") etc... + most of the time, I just start from an "id" in the "html" page, and go down to my wanted element ... . but when i've I use find/findAll - then i've some troubles .... e.g. if you want to do something like soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a") – Dieter Jun 25 '16 at 13:16
  • I almost always use css selectors when chaining tags or using `tag.classname`, if looking for a single element without a class I use find. It comes down to the use case and personal preferance. As far as flexibility goes I think you know the answer, `soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a")` would look pretty ugly using chained find/find_all – Padraic Cunningham Jun 25 '16 at 13:17
  • @PadraicCunningham - true, but yeah :) sometimes, when i scrape, i've the question - where those people drunk?? who made that website ... e.g. you start from the div with id "container" and then you've literally a tree of divs and classes with the same name. or even better, they aren't using a "id" at all :s – Dieter Jun 25 '16 at 14:05
  • The only issue with the css selectors in bs4 is the very limited support, `nth-of-type` is the only pseudo class implemented and chaining attributes like `a[href][src]` is also not supported as are many other parts of css selectors. But `a[href*=..]` and `a[href^=]` etc.. are very handy. – Padraic Cunningham Jun 25 '16 at 14:08
  • 6
    @PadraicCunningham you have very good points in the comments. Why don't you summarize them into an answer? – alecxe Jun 25 '16 at 17:29
  • @alecxe, I will throw something together in a bit, maybe a few timing comparisons would complete the answer – Padraic Cunningham Jun 25 '16 at 18:43
  • 2
    Please don't use `findAll()` anymore, as it doesn't follow Python's naming conventions. There's a `find_all()` method. – BlackJack Apr 05 '20 at 07:56

1 Answers1

103

To summarise the comments:

  • select finds multiple instances and returns a list, find finds the first, so they don't do the same thing. select_one would be the equivalent to find.
  • I almost always use css selectors when chaining tags or using tag.classname, if looking for a single element without a class I use find. Essentially it comes down to the use case and personal preference.
  • As far as flexibility goes I think you know the answer, soup.select("div[id=foo] > div > div > div[class=fee] > span > span > a") would look pretty ugly using multiple chained find/find_all calls.
  • The only issue with the css selectors in bs4 is the very limited support, nth-of-type is the only pseudo class implemented and chaining attributes like a[href][src] is also not supported as are many other parts of css selectors. But things like a[href=..]* , a[href^=], a[href$=] etc.. are I think much nicer than find("a", href=re.compile(....)) but again that is personal preference.

For performance we can run some tests, I modified the code from an answer here running on 800+ html files taken from here, is is not exhaustive but should give a clue to the readability of some of the options and the performance:

The modified functions are:

from bs4 import BeautifulSoup
from glob import iglob


def parse_find(soup):
    author = soup.find("h4", class_="h12 talk-link__speaker").text
    title = soup.find("h4", class_="h9 m5").text
    date = soup.find("span", class_="meta__val").text.strip()
    soup.find("footer",class_="footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":")
    soup.find_all("span",class_="talk-transcript__fragment")



def parse_select(soup):
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    date = soup.select_one("span.meta__val").text.strip()
    soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text
    soup.select("span.talk-transcript__fragment")


def  test(patt, func):
    for html in iglob(patt):
        with open(html) as f:
            func(BeautifulSoup(f, "lxml")

Now for the timings:

In [7]: from testing import test, parse_find, parse_select

In [8]: timeit test("./talks/*.html",parse_find)
1 loops, best of 3: 51.9 s per loop

In [9]: timeit test("./talks/*.html",parse_select)
1 loops, best of 3: 32.7 s per loop

Like I said not exhaustive but I think we can safely say the css selectors are definitely more efficient.

Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    I used select to retrieve instances of a dt class name. Instead of giving me all the result, it always ends up giving 10 instances. Is there a default value to be altered? – Lakshmi Narayanan Mar 12 '19 at 22:27
  • 1
    @LakshmiNarayanan, are you sure the html is not broken or not added dynamically somehow? – Padraic Cunningham Mar 28 '19 at 20:47
  • 1
    I had to issue a scroll down command to load the html completely. Though the page was loaded, the html did not. Thanks for the reply! – Lakshmi Narayanan Mar 29 '19 at 09:49
  • The `find…()` methods don't need the `class_=` or a dictionary with just a `"class"` key because if there's just as string as second argument that's the class value. – BlackJack Apr 05 '20 at 08:06
  • css selectors are awesome even with limited scope of `BeautifulSoup` – imbr Jun 11 '20 at 14:20
  • 1
    The limited nature of CSS selectors is pretty outdated information as selectors are now implemented by soupsieve and are much more powerful. While it is true there is still some limitation (no pseudo-elements and some non-applicable pseudo-classes as they don't apply in non-browser environments), there is a lot that is now supported: https://facelessuser.github.io/soupsieve/selectors/. – facelessuser Dec 09 '20 at 20:09
  • @PadraicCunningham , was wondering if the performance is still valid now? with version 4.9.3? – shawnngtq Apr 04 '21 at 02:26
  • today select is so powerful , – urek mazino Aug 30 '23 at 14:00