Python regular expressions ending at "

Question

I'm trying to take a long sting and extract all the urls it contains.

page.findall(r"http://.+")

is what I have, but that doesn't result in what I want. The urls are all wrapped in double quotes, so how can I tell regular expressions to stop matching when it reaches a "?

mayhewr · Accepted Answer · 2012-10-24T20:48:39.960

3

There are very complex url-parsing regexes out there, but if you want to stop at a ", just use [^\"]+ for the url part.

Or switch to a single-quoted string and remove the \.

Also, if you have https mixed in, it will break, so you might want to just go with

page.findall(r'"(http[^"]+)"')

But now we're getting into url-parsing regexes.

edited Oct 24 '12 at 20:48

answered Oct 24 '12 at 20:29

mayhewr

4,003
1
18
15

(I think) the raw string won't work with the `\"`, so if not following the single quote advice, drop the `r` string prefix. – Joachim Isaksson Oct 24 '12 at 20:35
@JoachimIsaksson I don't think `\"` breaks it... it's just unnecessary – mayhewr Oct 24 '12 at 20:38

score 0 · Answer 2 · answered Oct 24 '12 at 20:39

0

It is better to use a non greedy expression here instead of using [^\"]+. That way your regex would be r'"http://.+?"'. The question mark after the plus makes it so that it finds to the first encounter of a double quote.

answered Oct 24 '12 at 20:39

BrtH

2,610
16
27

That's what I came up with after what I posted in the question, but now I end up with a list of all the urls, but they all end with a ". How can I get rid of that using regex? – user1624005 Oct 24 '12 at 20:41
How is `[^\"]+` greedy? It will never match any quotes, let alone gather as many as possible. Am I missing something? – mayhewr Oct 24 '12 at 20:41
Could you please explain why non greedy is better than the character class? – halex Oct 24 '12 at 20:41
`[\"]+` is not the greedy part I was talking about, I was referring to `.+`. IMHO `.+?` looks a lot cleaner than `[\"]+`, and it also avoids the discussion about escaping the double quotes or not. – BrtH Oct 24 '12 at 20:43
1

@user1624005 To capture just the url, you should put the part inside the quotes in parentheses, like `r'"(http://.+?)"'` or `'"(http://[^"]+)"'` – mayhewr Oct 24 '12 at 20:44

Python regular expressions ending at "

2 Answers2

Linked