In python, how to do regex which catches an url in a

Question

I am trying to make a regex in Python which catches an url in a :

<a href tag

For example, if i take this :

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/" n    title="Permalink to Broccoli Slaw with Cranberry Orange Dressing" rel="bookmark"><img    width="520" height="347"

I need this expression to be catched:

<a href="http://www.simplyrecipes.com/recipes/broccoli_slaw_with_cranbery_orange_dressing/"

So this is what i have done :

^<a href="http://www(???what to put in here????)"$

But i don't know how to traduct the part of the expression after www which must be included but not specially treated.

Thanks in advance for any enlightenment!

Bear in mind that regular expresions are not sophisticated enough to parse any XML http://stackoverflow.com/a/1732454/7918. — jb., Jan 19 '14 at 20:32
That's always worth pointing out, but all s/he needs is to match between double quotes. — alexis, Jan 19 '14 at 20:40

Elisha · Accepted Answer · 2014-01-19T20:40:26.017

2

all that is not " : [^"]

so you can put: [^"]*"

and get: '<a href="[^"]*"'

edited Jan 19 '14 at 20:40

answered Jan 19 '14 at 20:32

Elisha

4,811
4
30
46

score 1 · Answer 2 · answered Jan 19 '14 at 20:36

You'll soon discover that not all URLs start with www, and many don't even start with http://. Here's how you would extract all URLs in an href attribute of a link: Match everything within the quotes that follow the <a href=. Spaces are legal in various places inside an HTML tag, which complicates things a little:

matchobj = re.search(r'<\s*a\s+href\s*=\s*"([^"]*)', text, re.IGNORECASE)
url = matchobj.group(1)

This will also get you relative URLs and other protocols besides http. If you're not interested in everything, it is easier to sort through the results after you've extracted them.

score 1 · Answer 3 · answered Jan 19 '14 at 20:40

1

Use import re

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', url)

answered Jan 19 '14 at 20:40

Perefexexos

252
1
8

In python, how to do regex which catches an url in a

3 Answers3