24

I need to be able to modify every single link in an HTML document. I know that I need to use the SoupStrainer but I'm not 100% positive on how to implement it. If someone could direct me to a good resource or provide a code example, it'd be very much appreciated.

Thanks.

Evan Fosmark
  • 98,895
  • 36
  • 105
  • 117

3 Answers3

48

Maybe something like this would work? (I don't have a Python interpreter in front of me, unfortunately)

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
  a['href'] = a['href'].replace("google", "mysite")

result = str(soup)
nude
  • 35
  • 6
Lusid
  • 4,518
  • 1
  • 24
  • 24
32
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a'):
    a['href'] = a['href'].replace("google", "mysite")
print str(soup)

This is Lusid's solution, but since he didn't have a Python interpreter in front of him, he wasn't able to test it and it had a few errors. I just wanted to post the working condition. Thank's Lusid!

Evan Fosmark
  • 98,895
  • 36
  • 105
  • 117
8

I tried this and it worked, it's easier to avoid using regexp for matching each 'href':

from bs4 import BeautifulSoup as bs
soup = bs(htmltext)
for a in soup.findAll('a'):
    a['href'] = "mysite"

Check it out, on bs4 docs.

Aziz Alto
  • 19,057
  • 5
  • 77
  • 60