0

I'm trying to take a long sting and extract all the urls it contains.

page.findall(r"http://.+")

is what I have, but that doesn't result in what I want. The urls are all wrapped in double quotes, so how can I tell regular expressions to stop matching when it reaches a "?

user1624005
  • 967
  • 1
  • 12
  • 18

2 Answers2

3

There are very complex url-parsing regexes out there, but if you want to stop at a ", just use [^\"]+ for the url part.

Or switch to a single-quoted string and remove the \.

Also, if you have https mixed in, it will break, so you might want to just go with

page.findall(r'"(http[^"]+)"')

But now we're getting into url-parsing regexes.

mayhewr
  • 4,003
  • 1
  • 18
  • 15
0

It is better to use a non greedy expression here instead of using [^\"]+. That way your regex would be r'"http://.+?"'. The question mark after the plus makes it so that it finds to the first encounter of a double quote.

BrtH
  • 2,610
  • 16
  • 27
  • That's what I came up with after what I posted in the question, but now I end up with a list of all the urls, but they all end with a ". How can I get rid of that using regex? – user1624005 Oct 24 '12 at 20:41
  • How is `[^\"]+` greedy? It will never match any quotes, let alone gather as many as possible. Am I missing something? – mayhewr Oct 24 '12 at 20:41
  • Could you please explain why non greedy is better than the character class? – halex Oct 24 '12 at 20:41
  • `[\"]+` is not the greedy part I was talking about, I was referring to `.+`. IMHO `.+?` looks a lot cleaner than `[\"]+`, and it also avoids the discussion about escaping the double quotes or not. – BrtH Oct 24 '12 at 20:43
  • 1
    @user1624005 To capture just the url, you should put the part inside the quotes in parentheses, like `r'"(http://.+?)"'` or `'"(http://[^"]+)"'` – mayhewr Oct 24 '12 at 20:44