5

How can I use BeautifulSoup to find all the links in a page pointing to a specific domain?

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
Juanjo Conti
  • 28,823
  • 42
  • 111
  • 133

1 Answers1

8

Use SoupStrainer,

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re

# Find all links
links = SoupStrainer('a')
[tag for tag in BeautifulSoup(doc, parseOnlyThese=links)]

linkstodomain = SoupStrainer('a', href=re.compile('example.com/'))

Edit: Modified example from official doc.

viksit
  • 7,542
  • 9
  • 42
  • 54
  • 1
    I would be more selective with the regex; that one could result in false positives. – Ignacio Vazquez-Abrams Jan 28 '10 at 05:07
  • @Ignacio - right, this example has that caveat - the regex should obviously be as detailed as possible so as to avoid those false positives. – viksit Jan 28 '10 at 07:57
  • No, you should typically not try to parse html with regex, here is an elaborate explanation(s) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – subiet Mar 29 '12 at 12:25
  • @subiet, this example is not using regex to parse the HTML. It is being used to limit the results to a known subset by matching the href attribute. – Scone Nov 04 '15 at 18:30