2

For example, I have a string like '(10 + 20) / (10 + 20)'.

And now I want to match (10 + 20). So I write a script like this:

text = '(10 + 20) / (10 + 20)'                                                                                                          
test1 = re.findall(r'(.*)', text)                                            
test2 = re.findall(r'(.+?)', text)                                           

for i in test1:                                                              
    print(i, end='')                                                         
else:                                                                        
    print()                                                                  

for i in test2:                                                              
    print(i, end='')                                                         
else:                                                                        
    print()       

And the output is this:

(10 + 20) / (10 + 20)                                                                                                                       
(10 + 20) / (10 + 20)

I don't understand, doesn't .+? not greedy?

Remi Guan
  • 21,506
  • 17
  • 64
  • 87

3 Answers3

4

The round brackets in a regex pattern must be escaped with \ to match literal round brackets:

test2 = re.findall(r'\(.+?\)', text) 

See demo

A "raw" string literal does not mean that you do not have to escape special regex characters but it means you can use just one backslash to escape them, not two.

See this excerpt from 6.2.5.8. Raw String Notation:

Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

>>>
>>> re.match(r"\W(.)\1\W", " ff ")
<_sre.SRE_Match object; span=(0, 4), match=' ff '>
>>> re.match("\\W(.)\\1\\W", " ff ")
<_sre.SRE_Match object; span=(0, 4), match=' ff '>

The docs say usually, but it does not mean you have to use raw string literals every time.

It is true that .+? is a lazy pattern, it means match 1 or more characters other than a newline, but as few as possible.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks, I understand now :) – Remi Guan Sep 16 '15 at 09:32
  • There is one thing of interest in Python: when an escape sequence cannot be parsed as an escape sequence, the backslash is treated as a literal. `'\('` and `'\\('` will be printed the same, and no error will be thrown. So, using a raw string literal here is optional, but it is still the best practice. – Wiktor Stribiżew Sep 16 '15 at 09:39
  • So whatever if I need use raw string or not, but use raw string is always a good choice. Right? – Remi Guan Sep 16 '15 at 09:43
  • @KevinGuan: Unless you use Unicode or very simple patterns, yes. I added more description from the Python 3 re reference. – Wiktor Stribiżew Sep 16 '15 at 10:21
2
>>> import re
>>> re.findall(r'\([^()]+\)', '(10 + 20) / (10 + 20)')
['(10 + 20)', '(10 + 20)']

The dialect of regular expressions used in the re module can't support arbitrary nested parentheses therefore [^()] that matches everything except parentheses is always valid here.

Note: you don't need to escape () inside [] that defines a set of characters.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

You need to escape ( and ) like this:

`\(.*\)`

and this

`\(.+?\)`. 

The first one will match until it finds the last possible ), the other one i non-greedy and will stop at the first )

Lawrence Benson
  • 1,398
  • 1
  • 16
  • 33