Regex quantifiers

Question

I'm new to regex and this is stumping me.

In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?

<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress"><a href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?sk=page_map" target="_self">7508 15th Avenue, Brooklyn, New York 11228</a></span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&amp;sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">

Search for HTML parsers in python. – hjpotter92 Mar 29 '14 at 22:57 — hjpotter92, Mar 29 '14 at 22:57
http://stackoverflow.com/a/1732454/2823755 – wwii Mar 29 '14 at 23:40 — wwii, Mar 29 '14 at 23:40

score 4 · Answer 1 · answered Mar 29 '14 at 23:05

4

As much as I love regex, this is an html parsing task:

>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']

answered Mar 29 '14 at 23:05

behzad.nouri

74,723
18
126
124

It's probably a better way to proceed, however, this doesn't explain why the pattern doesn't work. – Casimir et Hippolyte Mar 30 '14 at 00:18
@CasimiretHippolyte the question says "I want to extract ..."; and that is what above does. also, [this](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) may be relevant. – behzad.nouri Mar 30 '14 at 00:32
@CasimiretHippolyte -- many thanks for the regex explanation. I did use BeautifulSoup for a similar task – Peter Mar 30 '14 at 03:04

score 3 · Answer 2 · answered Mar 29 '14 at 23:00

3

This works :)

facebook\.com\/[^>]*?sk=info

Regular expression visualization

Debuggex Demo

With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.

The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.

And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.

answered Mar 29 '14 at 23:00

aliteralmind

19,847
17
77
108

Go to debuggex. It works. The `?` may not be necessary, but it works. It's part of the `[^>]*?` which means zero or more *not `>`* characaters, possesively. It's not a regular `?`, it's the possessive modifier. – aliteralmind Mar 29 '14 at 23:03
Fair enough; it's a reluctant, not possessive, but kudos for using it. :) – Ray Toal Mar 29 '14 at 23:05
"Reluctant". Right. Not possessive. – aliteralmind Mar 29 '14 at 23:05
You were suggesting that the question mark in `?sk=info` is literally part of the url. It isn't. It's `;sk=info`. The question-mark is only to make the regex piece before it (`[^>]*`) reluctant. – aliteralmind Mar 29 '14 at 23:06

fejese · Answer 3 · 2014-04-04T07:11:18.440

2

The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:

facebook\.com\/[^"]*;sk=info

edited Apr 04 '14 at 07:11

answered Mar 29 '14 at 22:58

fejese

4,601
4
29
36

With the literal question mark, it does not work. Read the comments under my answer. – aliteralmind Mar 29 '14 at 23:04

Casimir et Hippolyte · Accepted Answer · 2014-03-30T01:30:43.880

Why your pattern doesn't work:

You pattern doesn't work because the regex engine try your pattern from left to right in the string.

When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).

This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).

Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.

Regex quantifiers

4 Answers4