What am i doing wrong with this regular expression

Question

links = re.findall('href="(http(s?)://[^"]+)"',page)

I have this regular expression to find all links in a website, I am getting this result:

('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')

When what I want is only this:

http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013

If I eliminate the "( after the href it gives me loads of errors, can someone explain why?

If you eliminate just one parenthesis, you end up with unbalanced parentheses (which are used to denote capture groups), which is a syntax error (of sorts). Why would you remove that parenthesis? — Reinstate Monica -- notmaynard, Nov 27 '13 at 16:57

falsetru · Answer 1 · 2013-11-27T17:01:21.163

If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):

>>> import re
>>> page = '''
...     <a href="http://asecuritysite.com">here</a>
...     <a href="https://www.sans.org/webcasts/archive/2013">there</a>
...     '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']

According to re.findall documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

score 1 · Accepted Answer · answered Nov 27 '13 at 16:54

1

Try getting rid of the second group (the (s?) in your original pattern):

links = re.findall('href="(https?:\/\/[^"]+)"',page)

answered Nov 27 '13 at 16:54

p.s.w.g

146,324
30
291
331

@user2988983 *falsetru* has provided an [excellent explanation](http://stackoverflow.com/a/20248449/1715579). Basically, using multiple capture groups causes the result to be formatted this way. – p.s.w.g Nov 27 '13 at 17:06

score 1 · Answer 3 · edited May 23 '17 at 12:12

1

What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.

See here for the horrors of Regex parsing HTML

An alternative is to use something like lxml to parse the page and extract the links something like this

urls = html.xpath('//a/@href')

edited May 23 '17 at 12:12

Community

1
1

answered Nov 27 '13 at 16:54

Jakob Bowyer

33,878
8
76
91

2

Actually (as many comments in that link also point out) OP is _not_ parsing HTML, but rather simply searching for a string matching a pattern. – Reinstate Monica -- notmaynard Nov 27 '13 at 16:56
I don't see any evidence that he isn't trying to parse html – Jakob Bowyer Nov 27 '13 at 16:57
@iamnotmaynard its especially scary as he is regex searching a `page` variable. – Jakob Bowyer Nov 27 '13 at 17:04

brandonscript · Answer 4 · 2013-11-27T17:15:27.007

0

You're going to run into problems too if it's a single quote before the https? instead of double.

(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:

Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)

MATCH 1

[Group 1] href='
[Group 2] http://asecuritysite.com

MATCH 2

[Group 1] href='
[Group 2] https://www.sans.org/webcasts/archive/2013

http://regex101.com/r/gO8vV7 here is a working example

edited Nov 27 '13 at 17:15

answered Nov 27 '13 at 17:04

brandonscript

68,675
32
163
220

Don't forget to forbid white characters (that can not be inside an href attribute) – Casimir et Hippolyte Nov 27 '13 at 17:18
@CasimiretHippolyte It's best not to do that just incase the original programmer is putting in illegal characters. So long as it's not followed by the quote or `>` delimiter, it'll get captured. – brandonscript Nov 27 '13 at 17:23
Why not but if you have an other attribute after this doesn't work. – Casimir et Hippolyte Nov 27 '13 at 17:35

What am i doing wrong with this regular expression

4 Answers4