2

I want to find all the substrings wrapped in the double quotes satisfying the following two constraints:

  1. The shortest substring starting with "http"
  2. End with ".bmp" or ".jpg"

My codes are as below:

import re
pat = '"(http.+?\.(jpg|bmp))"'  # I don't how to modify this pattern
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk"  ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)

My expected outputs are

['http:afd/aa.bmp', 'http:kk/bb.jpg']

But the execution results are

[('http:afd/aa.bmp', 'bmp'), ('http--test--http:kk/bb.jpg', 'jpg')]

I have already tried several kinds of patterns but I still can't get what I want.

How should I modify my codes to get the results I expect? Thanks!

Wilson
  • 536
  • 1
  • 5
  • 15

1 Answers1

5

Use a [^"]* negated character class after the first " to stay within double quoted substring (note - this will only work if there are no escape sequences in the string and get to the last http, then add it at the end, too, to get to the trailing ".

import re
pat = r'"[^"]*(http.*?\.(?:jpg|bmp))[^"]*"'
reg = re.compile(pat)
aa = '"http:afd/aa.bmp" :tt: "kkkk"  ++, "http--test--http:kk/bb.jpg"'
print reg.findall(aa)
# => ['http:afd/aa.bmp', 'http:kk/bb.jpg']

See the Python demo online.

Pattern details:

  • " - a literal double quote
  • [^"]* - 0+ chars other than a double quote, as many as possible, since * is a greedy quantifier
  • (http.*?\.(?:jpg|bmp)) - Group 1 (extracted with re.findall) that matches:
    • http - a literal substring http
    • .*? - any 0+ chars, as few as possible (as *? is a lazy quantifier)
    • \. - a literal dot
    • (?:jpg|bmp) - a non-capturing group (so that the text it matches could not be output with re.findall) matching either jpg or bmp substring
  • [^"]* - 0+ chars other than a double quote, as many as possible
  • " - a literal double quote
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Hi Wiltor, thanks for detailed explanation. But I have one more question about the second extracted string, the shortest string start with http. A simple test is aa='http--test--http:kk/bb.jpg', pat='http.*?\.jpg' . Why can't I get "http:kk/bb.jpg"? I still got 'http--test--http:kk/bb.jpg' even though I have used .*? , which should get string as short as possible – Wilson May 06 '17 at 06:26
  • You are confused by some SO answers stating that lazy patterns match the shortest substring - *it is not true*. The regex engine works from left to right, and when you use `http.*?\.jpg`, it will first find the left-most `http` and then will match *as many chars as necessary* to get to the first `.jpg`. See [how the regex engine works](https://regex101.com/r/Al8u6w/1/debugger). Use a [tempered greedy token](http://stackoverflow.com/a/37343088/3832970), see [the `http(?:(?!http).)*?\.jpg` regex demo](https://regex101.com/r/xjkYGi/1). – Wiktor Stribiżew May 06 '17 at 08:33
  • But for the second substring in original question, why the pattern you replied can output the substring with http right most matched, http:kk/bb.jpg ? I try to trace the parsing procedure with regex101, the results show the full match for the second string (https://regex101.com/r/xjkYGi/1/debugger). Thanks so much – Wilson May 07 '17 at 15:07
  • You ask why `http(?:(?!http).)*?\.jpg` extracts `http:kk/bb.jpg` from `http--test--http:kk/bb.jpg`? Because after matching `http`, the `(?:(?!http).)*?` will match any 0+ chars, as few as possible, that do not start a `http` sequence. That is, that pattern matches any text but `http` up to `.jpg`. – Wiktor Stribiżew May 07 '17 at 17:39
  • Sorry, I just found that my provided link doesn't show the correct example. (I found that I need to fork it to show the correct example). The example link is "https://regex101.com/r/nOBrp5/1", the example you have replied firstly. I'm just curious about why pattern "[^"]*(http.*?\.(?:jpg|bmp))[^"]*" can extract substring "http:kk/bb.jpg" without using tempered greedy token. – Wilson May 08 '17 at 02:42
  • See my pattern explanation in the answer. Pay attention to the first **`[^"]*` - 0+ chars other than a double quote, *as many as possible*, since `*` is a *greedy* quantifier**. "As many as possible: it moves the regex index to the end of the matching string (to the end or to the first `"`) and then backtracks trying to accommodate some chars for the subsequent subpatterns. So, if your values are always at the end of the double quoted substring, use the one above, else, use a tempered greedy token. – Wiktor Stribiżew May 08 '17 at 06:27
  • Ok, I see. Appreciate your detailed explanations so much! – Wilson May 08 '17 at 11:31