0

I am trying to make a regex in Python which catches an url in a :

<a href tag

For example, if i take this :

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/" n    title="Permalink to Broccoli Slaw with Cranberry Orange Dressing" rel="bookmark"><img    width="520" height="347" 

I need this expression to be catched:

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/" 

So this is what i have done :

^<a href="http://www(???what to put in here????)"$

But i don't know how to traduct the part of the expression after www which must be included but not specially treated.

Thanks in advance for any enlightenment!

user2305415
  • 172
  • 1
  • 3
  • 18
  • 2
    Bear in mind that regular expresions are not sophisticated enough to parse any XML http://stackoverflow.com/a/1732454/7918. – jb. Jan 19 '14 at 20:32
  • That's always worth pointing out, but all s/he needs is to match between double quotes. – alexis Jan 19 '14 at 20:40

3 Answers3

2

all that is not " : [^"]

so you can put: [^"]*"

and get: '<a href="[^"]*"'

Elisha
  • 4,811
  • 4
  • 30
  • 46
1

You'll soon discover that not all URLs start with www, and many don't even start with http://. Here's how you would extract all URLs in an href attribute of a link: Match everything within the quotes that follow the <a href=. Spaces are legal in various places inside an HTML tag, which complicates things a little:

matchobj = re.search(r'<\s*a\s+href\s*=\s*"([^"]*)', text, re.IGNORECASE)
url = matchobj.group(1)

This will also get you relative URLs and other protocols besides http. If you're not interested in everything, it is easier to sort through the results after you've extracted them.

alexis
  • 48,685
  • 16
  • 101
  • 161
1

Use import re

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)
Perefexexos
  • 252
  • 1
  • 8