1

I'd like only to replace the data of the href-attribute in an a-element. Can this be done with a regex?

Example

<a href="tel:8196887620" value="+18196887620" target="_blank">8196887620</a>

I imagine you'll have to make a regex for <a ... > then apply another regex for the href and then another to only grab the data between the ". Is that correct or is there a better way to do this? Maybe a library in python?

David
  • 965
  • 3
  • 12
  • 24

3 Answers3

2

Using BeautifulSoup get 'anchor' tag href=

        import urllib
        from BeautifulSoup import *
        url = raw_input('Enter - ')
        html = urllib.urlopen(url).read()
        soup = BeautifulSoup(html)
        tags = soup('a')
        for tag in tags:
           print tag.get('href', None)
Benjamin
  • 2,257
  • 1
  • 15
  • 24
2

Thanks all. BeautifulSoup seems the way to go.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>')
for a in soup.findAll('a')
  a['href'] = a['href'].replace("google", "mysite")    
result = str(soup)

Source: BeautifulSoup - modifying all links in a piece of HTML?

David
  • 965
  • 3
  • 12
  • 24
1

You can't do it with regexps efficiently, because it is a (nearly) type3 - language. HTML is type2.

Altough as quick-and-dirty solutions they can maybe work, but you will fast reach their limits. In your case, it is the point.

If you really want, a solution like this will maybe work:

/<a [^>]*href="([^"]*)"/

A better solution were if you googled a little bit for xslt processing. There are good xslt-processing tools even for the command line, they did it for you.

peterh
  • 11,875
  • 18
  • 85
  • 108