7

I have this piece of code:

for n in (range(1,10)):
    new = re.sub(r'(regex(group)regex)?regex', r'something'+str(n)+r'\1', old, count=1)

It throws the unmatched group error. But if it is unmatched, I want to add empty string there instead of throwing an error. How could I achieve this?

Note: My full code is much more complicated than this example. But if you find out better solution how to iterate over matches and add number inside, you could share. My full code:

for n in (range(1,(text.count('soutez')+1))):
    text = re.sub(r'(?i)(\s*\{{2}infobox medaile reprezentant(ka)?\s*\|\s*([^\}]*)\s*\}{2}\s*)?\{{2}infobox medaile soutez\s*\|\s*([^\}]*)\s*\}{2}\s*', r"\n | reprezentace"+str(n)+r" = \3\n | soutez"+str(n)+r" = \4\n | medaile"+str(n)+r" = \n", text, count=1)
aleskva
  • 1,644
  • 2
  • 21
  • 40
  • If its unmatched, you want to add empty string? where? replace thewhole thing by one? – Michal Frystacky Feb 19 '16 at 22:34
  • 2
    Replace `(group)?` with `(group|)` – Wiktor Stribiżew Feb 19 '16 at 22:35
  • @MichalFrystacky empty string instead of group (that means instead of `\1`) – aleskva Feb 19 '16 at 22:35
  • @WiktorStribiżew wait, I'll update example. It actually is `(string(group)string)?` – aleskva Feb 19 '16 at 22:37
  • 1
    Ok, then use `(string(group)string|)`. I can't compile your code, I would have posted the answer already. – Wiktor Stribiżew Feb 19 '16 at 22:38
  • @aleskva Can you make a test case for the simple example that would throw that error? – Michal Frystacky Feb 19 '16 at 22:38
  • I tried to replace them with no success. I'll make a testcase and paste it here. – aleskva Feb 19 '16 at 22:45
  • @MichalFrystacky Here is a testcase for my case (my full code): http://pastebin.com/jDSijyXe I replaced `)?` by `|)` but unfortunately with no success – aleskva Feb 19 '16 at 23:01
  • Instead of writing these ugly `\{{2}` and `\}{2}`, write `{{` and `}}` (no backslashes needed). Without your original string and the output you want, it isn't possible to help you to rewrite your pattern. (that in my opinion is probably complicated for nothing). – Casimir et Hippolyte Feb 19 '16 at 23:14
  • @CasimiretHippolyte I am just used to it from enother programming language's regex library, but it still has no effect, since the error is still there. Please see pastebin link over your comment for specific piece of code. – aleskva Feb 19 '16 at 23:19
  • "no effect" is the expecting result for this change. A proof that backslashes and a quantifier for only 2 occurrences is useless. I will take a look to your pastebin. – Casimir et Hippolyte Feb 19 '16 at 23:24
  • @CasimiretHippolyte I understand. Thank you for your little correction – aleskva Feb 20 '16 at 11:50

3 Answers3

8

Root cause

Before Python 3.5, backreferences to failed capture groups in Python re.sub were not populated with an empty string. Here is Bug 1519638 description at bugs.python.org. Thus, when using a backreference to a group that did not participate in the match resulted in an error.

There are two ways to fix that issue.

Solution 1: Adding empty alternatives to make optional groups obligatory

You can replace all optional capturing groups (those constructs like (\d+)?) with obligatory ones with an empty alternative (i.e. (\d+|)).

Here is an example of the failure:

import re
old = 'regexregex'
new = re.sub(r'regex(group)?regex', r'something\1something', old)
print(new)

Replacing one line with

new = re.sub(r'regex(group|)regex', r'something\1something', old)

It works.

Solution 2: Using lambda expression in the replacement and checking if the group is not None

This approach is necessary if you have optional groups inside another optional group.

You can use a lambda in the replacement part to check if the group is initialized, not None, with lambda m: m.group(n) or ''. Use this solution in your case, because you have two backreferences - #3 and #4 - in the replacement pattern, but some matches (see Match 1 and 3) do not have Capture group 3 initialized. It happens because the whole first part - (\s*\{{2}funcA(ka|)\s*\|\s*([^}]*)\s*\}{2}\s*|) - is not participating in the match, and the inner Capture group 3 (i.e. ([^}]*)) just does not get populated even after adding an empty alternative.

re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', 
r"\n | funcA"+str(n)+r" = \3\n | funcB"+str(n)+r" = \4\n | string"+str(n)+r" = \n", 
text, 
count=1)

should be re-written with

re.sub(r'(?i)(\s*{{funcA(ka|)\s*\|\s*([^}]*)\s*}}\s*|){{funcB\s*\|\s*([^}]*)\s*}}\s*', 
lambda m: r"\n | funcA"+str(n)+r" = " + (m.group(3) or '') + "\n | funcB" + str(n) + r" = " + (m.group(4) or '') + "\n | string" + str(n) + r" = \n", 
text, 
count=1)  

See IDEONE demo

import re
 
text = r'''
 
{{funcB|param1}}
*some string*
{{funcA|param2}}
{{funcB|param3}}
*some string2*
 
{{funcB|param4}}
*some string3*
{{funcAka|param5}}
{{funcB|param6}}
*some string4*
'''
 
for n in (range(1,(text.count('funcB')+1))):
    text = re.sub(r'(?i)(\s*\{{2}funcA(ka|)\s*\|\s*([^\}]*)\s*\}{2}\s*|)\{{2}funcB\s*\|\s*([^\}]*)\s*\}{2}\s*', 
    lambda m: r"\n | funcA"+str(n)+r" = "+(m.group(3) or '')+"\n | funcB"+str(n)+r" = "+(m.group(4) or '')+"\n | string"+str(n)+r" = \n", 
    text, 
    count=1) 
    
assert text == r'''
| funcA1 =
| funcB1 = param1
| string1 =
*some string*
| funcA2 = param2
| funcB2 = param3
| string2 =
*some string2*
| funcA3 =
| funcB3 = param4
| string3 =
*some string3*
| funcA4 = param5
| funcB4 = param6
| string4 =
*some string4*
'''
print 'ok'
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Although it is really clever solution for simpler cases, in my case I tried with no luck. Please see a testcase in my comment under the question. – aleskva Feb 19 '16 at 23:05
  • My answer is still valid for your case. See [this regex demo](https://regex101.com/r/vD8zM5/2). There are matches where Backreference 3 is not initialized (see Match 1 and 3). It happens because the whole first part - `(\s*\{{2}funcA(ka|)\s*\|\s*([^}]*)\s*\}{2}\s*|)` - is not participating in the match. Basically, this is described [here](https://bugs.python.org/issue1519638). It seems you need to rewrite the regex, just adding empty alternatives is not enough. – Wiktor Stribiżew Feb 19 '16 at 23:27
  • Sure it is, but why it is still throwing an error in python? – aleskva Feb 19 '16 at 23:36
  • 1
    Check [this code](http://ideone.com/KhGHMc). The error appears because the backreference in the replacement pattern (#3) was not matched at all. – Wiktor Stribiżew Feb 19 '16 at 23:37
  • Solution 2 finally works! I had thought it could be solved by combination of matchobject.group() and lambda before I asked, but I had been still confused by unmatched group error. Now I can see clearly it is as simple as possible, just needed to add ` or ''`. Thank you – aleskva Feb 20 '16 at 11:46
  • This may fix the problem in it's current code form, but is not a solution in general, of making a group EMPTY instead of NULL. –  Feb 20 '16 at 18:48
  • @sln: It does. My Approach 1 is what you suggest in your answer. Approach 2 is the universal workaround. – Wiktor Stribiżew Feb 20 '16 at 18:49
  • Your approaches are not a general solution. For approach 1 `(group(group|)|)` results in group 2 being NULL. For approach 2, there is no need for logic on the replacement side at all. Following the 2 _rules_ I layed out is the _universal_ solution. It's design time, not runtime. –  Feb 20 '16 at 19:25
  • If you read what I wrote in the beginning of my answer, you will see that your rules are just a more complicated way of describing the root cause. – Wiktor Stribiżew Feb 20 '16 at 21:19
  • I hate to sound like a broken record but using those _rules_ tells you right away that the op's capture group 3 will never be EMPTY and that he/she would have to test the capture groups. It's _not_ form that dictates this. Rule 2 example: for `((a?))?` and `((a)?)?`, group 1 will _never_ be NULL because their contents will always match. In this `(a)?` the group will _always_ be NULL when the contents don't match. Rule 1 overrides rule 2, ie. in `(?:y?|((a?))?)` group 1 will _always_ be NULL. It's intuitive, but can be complex unless you understand the _rules_. –  Feb 21 '16 at 00:42
0

I looked at this again.
A note that it is unfortunate that you have to deal with NULL's,
but here are the rules you must follow.

The below matches all successfully match nothing.
You have to do this to find out the rules.

It's not as simple as you may think. Take a close look at the results.
There is no apparent steadfast way formwise to tell if you will get NULL or EMPTY.

However, looking at it closer, the rules come out and are fairly simple.
These rules must be followed if you care about NULL.

There are only Two rules:

Rule # 1 - Any code GROUP that can't be reached, will result in NULL

   (?<Alt_1>                     # (1 start)
        (?<a> a )?                    # (2)
        (?<b> b? )                    # (3)
   )?                            # (1 end)
|  
   (?<Alt_2>                     # (4 start)
        (?<c> c? )                    # (5)
        (?<d> d? )                    # (6)
   )                             # (4 end)
 **  Grp 0         -  ( pos 0 , len 0 )  EMPTY 
 **  Grp 1 [Alt_1] -  ( pos 0 , len 0 )  EMPTY 
 **  Grp 2 [a]     -  NULL 
 **  Grp 3 [b]     -  ( pos 0 , len 0 )  EMPTY 
 **  Grp 4 [Alt_2] -  NULL 
 **  Grp 5 [c]     -  NULL 

Rule # 2 - Any code GROUP that can't be matched on the INSIDE, will result in NULL

 (?<A_1>                       # (1 start)
      (?<a1> a? )                   # (2)
 )?                            # (1 end)
 (?<A_2>                       # (3 start)
      (?<a2> a )?                   # (4)
 )?                            # (3 end)
 (?<A_3>                       # (5 start)
      (?<a3> a )                    # (6)
 )?                            # (5 end)
 (?<A_4>                       # (7 start)
      (?<a4> a )?                   # (8)
 )                             # (7 end)
**  Grp 0       -  ( pos 0 , len 0 )  EMPTY 
**  Grp 1 [A_1] -  ( pos 0 , len 0 )  EMPTY 
**  Grp 2 [a1]  -  ( pos 0 , len 0 )  EMPTY 
**  Grp 3 [A_2] -  ( pos 0 , len 0 )  EMPTY 
**  Grp 4 [a2]  -  NULL 
**  Grp 5 [A_3] -  NULL 
**  Grp 6 [a3]  -  NULL 
**  Grp 7 [A_4] -  ( pos 0 , len 0 )  EMPTY 
**  Grp 8 [a4]  -  NULL 
  • Your solution has the same effect as replacing `(group)?` by `(group|)`. It doesn't work for complicated cases, it works just for simple ones. Overall thanks – aleskva Feb 20 '16 at 11:39
  • @aleskva - Ok, added some rules to determine where, when and how a group will be NULL or EMPTY. From the two rules, you can see the solution is _not_ form based at all. –  Feb 20 '16 at 18:44
0

To simplify:

Problem

  1. You are getting the error "sre_constants.error: unmatched group" from a Python 2.7 regex.
  2. You have any regex pattern with optional groups (with or without nested expressions) and are trying to use those groups in your sub replacement argument (re.sub(pattern, *repl*, string) or compiled.sub(*repl*, string))

Solution:

For results, return match.group(1) instead of \1 (or 2, 3, etc.). That's it; there is no or needed. The group result(s) can be returned with a function or a lambda.

Example

You are using a common regex to strip C-style comments. Its design uses an optional group 1 to pass through pseudo-comments which should not be deleted (if they exist).

pattern = r'//.*|/\*[\s\S]*?\*/|("(\\.|[^"])*")'
regex = re.compile(pattern)

Using \1 fails with the error: "sre_constants.error: unmatched group":

return regex.sub(r'\1', string)

Using .group(1) succeeds:

return regex.sub(lambda m: m.group(1), string)

For those not familiar with lambda, this solution is equivalent to:

def optgroup(match):
    return match.group(1)
return regex.sub(optgroup, string)

See the accepted answer for an excellent discussion of why \1 fails due to Bug 1519638. While the accepted answer is authoritative, it has two shortcomings: 1) the example from the original question is so convoluted that it makes the example solution difficult reading, and 2) it suggests returning a group or empty string -- that is not required, you may merely call .group() on each match.

JeremyDouglass
  • 1,361
  • 2
  • 18
  • 31