Regular expression to extract URL from an HTML link

Question

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

Duplicate: http://stackoverflow.com/questions/430966/regex-for-links-in-html-text — S.Lott, Jan 31 '09 at 22:04
I've been away from SO for a while, it's good to see I've missed nothing, and people are STILL asking how to parse HTML with regex every damn day. — bobince, Feb 01 '09 at 02:30
@bobince Multiple times a day, it is so bad I created two questions that I can redirect people to and a form answer that points them there. — Chas. Owens, May 13 '09 at 14:30

score 86 · Answer 1 · edited Jun 20 '20 at 09:12

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

Part of learning regexes is learning when not to use them, this is a case where you shouldn't use them. — Chas. Owens, May 13 '09 at 14:29
some pages are so badly formatted that even BeautifulSoup can't find the links in there. Then you have to resort to something. — Petter H, Aug 22 '13 at 21:07
Small improvement to the regexp: `re.findall(r'href\s?=\s?[\'"]?([^\'" >]+)', show_notes)`, which allows a space before and/or after the equals sign. — Leon Overweel, Jun 16 '18 at 22:25
Are you sure it is "match.group(0)" instead of "match.group(1)"? — pah8J, Jan 19 '19 at 04:25
Would it not make more sense, and is it not more correct, to write `if match:` as `if match is not None:` instead? — blizz, Oct 13 '22 at 23:09
@blizz doesn't really matter; re.search will return a match object or None, and there's no other possible falsey returns (i.e. we don't need to distinguish between None and False here). As such, `if match:` is more compact and understandable. — David, Jan 03 '23 at 16:17

score 13 · Answer 2 · answered Jan 31 '09 at 19:13

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

score 13 · Answer 3 · answered Jan 31 '09 at 19:16

13

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

answered Jan 31 '09 at 19:16

2

(?<=href=["']).*?(?=["']) takes care of single quoated href also – Neil Aug 13 '13 at 19:45

score 12 · Answer 4 · answered Nov 27 '09 at 23:37

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

score 3 · Answer 5 · answered Mar 08 '17 at 22:39

3

this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

answered Mar 08 '17 at 22:39

Hamedz

726
15
27

score 3 · Answer 6 · edited May 23 '17 at 12:09

3

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

edited May 23 '17 at 12:09

Community

1
1

answered May 13 '09 at 14:38

Chas. Owens

64,182
22
135
226

score 2 · Answer 7 · answered Jan 31 '09 at 19:34

2

There's tonnes of them on regexlib

answered Jan 31 '09 at 19:34

Chris S

64,770
52
221
239

score 1 · Answer 8 · edited Jun 20 '20 at 09:12

1

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

edited Jun 20 '20 at 09:12

Community

1
1

answered May 20 '16 at 06:07

Rohit Malgaonkar

483
5
5

When entering this regular expression in a python program (not through the site you mentioned) it will give an error due to the usage of text quotation marks `'` or `"`. To fix this the regex should be: `regex='(?:href=[\'"])([:/.A-z?<_&\s=>0-9;-]+)'` by adding a slant \ before the `'` or the `"`. – Mohammad ElNesr Dec 22 '16 at 09:46

score 1 · Answer 9 · answered May 13 '09 at 14:22

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

score -1 · Answer 10 · answered Apr 24 '18 at 07:50

-1

You can use this.

<a[^>]+href=["'](.*?)["']

answered Apr 24 '18 at 07:50

arjan

1

Regular expression to extract URL from an HTML link

10 Answers10

Linked

Related