12

Hi I cannot figure out how to find links which begin with certain text for the life of me. findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with http://www.nhl.com/ice/boxscore.htm?id=

Can anyone help me?

Thank you very much

Jen Scott
  • 679
  • 2
  • 7
  • 17

3 Answers3

16

First set up a test document and open up the parser with BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div><a href="something">yep</a></div><div><a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a></div><a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a></body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
 <body>
  <div>
   <a href="something">
    yep
   </a>
  </div>
  <div>
   <a href="http://www.nhl.com/ice/boxscore.htm?id=3">
    somelink
   </a>
  </div>
  <a href="http://www.nhl.com/ice/boxscore.htm?id=7">
   another
  </a>
 </body>
</html>

Next, we can search for all <a> tags with an href attribute starting with http://www.nhl.com/ice/boxscore.htm?id=. You can use a regular expression for it:

>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[<a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a>, <a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a>]
jterrace
  • 64,866
  • 22
  • 157
  • 202
2

You might not need BeautifulSoup since your search is specific

>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))
pensebien
  • 506
  • 4
  • 16
1

You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.

listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []

for link in listOfAllLinks:
    if "www.nhl.com" in link:
        listOfLinksINeed.append(link['href'])
Hrvoje
  • 13,566
  • 7
  • 90
  • 104