0
links = re.findall('href="(http(s?)://[^"]+)"',page)

I have this regular expression to find all links in a website, I am getting this result:

('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')

When what I want is only this:

http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013

If I eliminate the "( after the href it gives me loads of errors, can someone explain why?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156

4 Answers4

2

If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):

>>> import re
>>> page = '''
...     <a href="http://asecuritysite.com">here</a>
...     <a href="https://www.sans.org/webcasts/archive/2013">there</a>
...     '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']

According to re.findall documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

falsetru
  • 357,413
  • 63
  • 732
  • 636
1

Try getting rid of the second group (the (s?) in your original pattern):

links = re.findall('href="(https?:\/\/[^"]+)"',page)
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • @user2988983 *falsetru* has provided an [excellent explanation](http://stackoverflow.com/a/20248449/1715579). Basically, using multiple capture groups causes the result to be formatted this way. – p.s.w.g Nov 27 '13 at 17:06
1

What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.

See here for the horrors of Regex parsing HTML

An alternative is to use something like lxml to parse the page and extract the links something like this

urls = html.xpath('//a/@href')
Community
  • 1
  • 1
Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
0

You're going to run into problems too if it's a single quote before the https? instead of double.

(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:

Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)

MATCH 1

  • [Group 1] href='
  • [Group 2] http://asecuritysite.com

MATCH 2

  • [Group 1] href='
  • [Group 2] https://www.sans.org/webcasts/archive/2013

http://regex101.com/r/gO8vV7 here is a working example

brandonscript
  • 68,675
  • 32
  • 163
  • 220