Find specific link w/ beautifulsoup

Question

Hi I cannot figure out how to find links which begin with certain text for the life of me. findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with http://www.nhl.com/ice/boxscore.htm?id=

Can anyone help me?

Thank you very much

score 16 · Accepted Answer · answered Oct 11 '11 at 21:35

First set up a test document and open up the parser with BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div><a href="something">yep</a></div><div><a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a></div><a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a></body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
 <body>
  <div>
   <a href="something">
    yep
   </a>
  </div>
  <div>
   <a href="http://www.nhl.com/ice/boxscore.htm?id=3">
    somelink
   </a>
  </div>
  <a href="http://www.nhl.com/ice/boxscore.htm?id=7">
   another
  </a>
 </body>
</html>

Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:

>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[<a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a>, <a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a>]

Wow thanks. I guess the beautifulsoup documentation presupposes fluency in regex. Thank you for showing me that — Jen Scott, Oct 11 '11 at 21:41
Just use: ``kwargs={'class':foo}`` then ``soup.findAll('a', **kwargs)`` — jterrace, Dec 09 '13 at 23:39
just found that we can use class_ to refer to class http://stackoverflow.com/questions/13794532/python-regular-expression-for-beautiful-soup — Wajih, Dec 09 '13 at 23:58

pensebien · Answer 2 · 2016-05-02T16:26:47.293

2

You might not need BeautifulSoup since your search is specific

>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))

edited May 02 '16 at 16:26

answered May 02 '16 at 16:05

pensebien

506
4
16

score 1 · Answer 3 · answered Mar 30 '21 at 12:52

You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.

listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []

for link in listOfAllLinks:
    if "www.nhl.com" in link:
        listOfLinksINeed.append(link['href'])

Find specific link w/ beautifulsoup

3 Answers3

Linked