Is there a reason for python regex not to compile r'(\s*)+'?

Question

I don't understand why '(\s*)+' gives an error 'nothing to repeat'. At the same time '(\s?)+' goes just fine.

I've discovered that this problem has been known about quite for some time (for example regex error - nothing to repeat ) but I still see it in Python 3.3.1.

So I am wondering if there is a rational explanation for this behavior.

In reality I want to match a line of repeated words or numbers, for example:

'foo foo foo foo'

I've come up with this:

'(\w+)\s+(\1\s*)+'

It failed because of the second group: (\1\s*)+ In most cases I would probably not have more than 1 space between words so (\1\s?)+ would work. For practical purposes this option also should work (\1\s{0,1000})+

Update: I think I should add that I've seen the problem in python only. In perl it works:

`('foo foo foo foo' =~ /(\w+)\s+(\1\s*)+/) `

Not sure it's equivalent but vim also works:

`\(\<\w\+\>\)\_s\+\(\1\_s*\)\+`

Update2: I found another implementation of regex for python which is said to replace current re someday. I checked and the error doesn't occur for the above problematic cases. This module has to be installed separately. It can be downloaded here or via pypi

As far as getting your problem fixed you should try this: http://stackoverflow.com/questions/17202233/remove-all-replicas-of-a-string-more-than-x-characters-long-regex — Slater Victoroff, Jul 10 '13 at 20:29
I don't know what Python's problem is, this works just fine in perl and PowerShell. Note, however, that what you have would match things like `foo foofoo` even if it did work. I'm assuming that's not what you want, since you're not matching `foofoo` or `foofoofoo` (in other words, the first instance has to be followed by whitespace but after that the words can be joined together). Try this regex: `(\w+)\s+(\1(\s+|$))+`. I suspect that's what you really want, and Python probably won't have a problem with it. — Adi Inbar, Jul 10 '13 at 21:02
@Adi thank you, that's a good point and yes, that is what I want. But unfortunately this still doesn't work in python — Phoenix, Jul 10 '13 at 21:09
Seriously? What error does it give you on that one? Based on Slater Tyranus's experimentation, it sounds like Python barfs on quantifying quantifiers that can match nothing, but doesn't have a problem with quantifying quantifiers that have to match at least one character. Since `(\1(\s+|$))` does have to match something, there should be no problem. Unless Python doesn't allow you to quantify $ because there has to be exactly one per string. Maybe `(\w+)\s+(\1(\s+|\b))+` will work? — Adi Inbar, Jul 10 '13 at 22:14
For both of them? Using all `+`'s and no `*`'s? It seems that Python's "nothing to repeat" error comes from trying to quantify a null match, but there's nothing in these regexes that can match a null. If these both fail as well, then Python's regex engine is seriously broken. I don't mean to start a religious war, but at this point it's *very* tempting to post an answer consisting of two words: "Use perl". — Adi Inbar, Jul 10 '13 at 23:57
What do you mean by both? `\b` and `$`? I've checked all the options that you've suggested. And I get the same error `nothing to repeat`. The problem happens only in cases when I have a group that can match to null and that group can be repeated `*` times. For example `'\s*'` and `'(\s)*'` work fine but `'(\s*)*'` gives an error. According to the traceback `*` resolves to `{0,MAXREPEAT}` and on that condition the error is raised. It means that 'dirty hack' can be to use `{0,MAXREPEAT-1}` instead of `*`. But that doesn't cancel the problem with regex engine. — Phoenix, Jul 11 '13 at 12:17
Interestingly, in Python 2.7.5, that regex compiles just fine. — Tim Pietzcker, Jul 15 '13 at 09:08

Slater Victoroff · Answer 1 · 2013-07-10T20:41:45.210

The problem that python has with this is primarily the null issue brought up in the linked post. If you're going to have at least one character I suggest instead using:

(\s+)+

That said, it also doesn't really make sense if you ask for (\s*)+ with the idea that + requires something to exist, and * does not. It doesn't quite make sense to match ? either, but you can resolve it mentally by saying it's an optional match meaning that if it doesn't find one it moves on, rather than * which interprets nothing as a matched pattern.

However, if you really want to check what Python's issue with something is I suggest playing around with ranges. For instance, I came to my conclusion by using these two examples:

re.compile("(\s{1,})+")

which is fine

re.compile("(\s{0,})+")

which fails in the same manner.

At the very least this means it is not a "bug" in Python. It is a conscious design decision that acts on every regex pattern that conceptually falls into this same pit. My guess (checked in a few different environments) is that (\s{0,})+ will reliably fail because it explicitly repeats a potentially null element.

However, it seems that a number of environments use * to indicate that a match is optional, and python does not follow this choice. It makes sense for many cases, but occasionally leads to weird behaviour. I think Guido made the right choice here, as having an inconsistent space presence means you've violated the pumping lemma and your pattern is no longer context free.

In this case it probably wouldn't matter much, but it means there would inevitably be an ambiguity in that regex that couldn't be resolved.

So you had a problem, then you chose to use regex to solve that problem. Now you have 2 problems, C'est la vie.

Thank you for the advises! I've tried similar things to make it work. I understand that `(\s*)+` doesn't make much sense. The actual thing I need is `(\1\s*)+`. And The reason I asked this question is because vim manages this regexp: `$\<\w\+\>$\_s\+$\1\_s*$\+` which I think equivalent (though I am not good in vim regex) — Phoenix, Jul 10 '13 at 21:00
@Phoenix Not quite equivalent, though quite close. Vim handles the \s* a little differently. If you want to do the same thing you should replace all your multiple spaces with single spaces and then use the `?` operator instead. — Slater Victoroff, Jul 10 '13 at 21:02
But it does make sense if something precedes the \s in the match group, as his example shows, but he says that also fails in Python. `(\1\s*)+` matches the value of the first group with or without any amount of trailing whitespace one or more times, whereas `(\1\s+)+` will only match it if it's followed by whitespace. So it is useful to be able to do that. I do suspect though, that this is closer to what he wants, but the problem is that it won't match a repetition of the word at the end of the string, which is why I suggested `(\1(\s*|$))` above. — Adi Inbar, Jul 10 '13 at 21:07

score 0 · Answer 2 · answered Jul 10 '13 at 20:42

0

Slater has given a good overview of the problem, but I just wanted to add that if you think about it, that theoretically matches an infinite number of empty spaces on the first empty space that it encounters. If you could compile that expression, applying it might very well just result in an infinite loop before the first character is even seen. So not only is it not a bug, it's a good thing.

answered Jul 10 '13 at 20:42

a p

3,098
2
24
46

Uhh... No... that would never happen, but it is a good thing. – Slater Victoroff Jul 10 '13 at 20:59
And what is the difference with `(\s?)+` which is ok but can result in the same infinite loop. As I've replied to Slater, what bother me is that in vim equivalent regexp works fine: `$\<\w\+\>$\_s\+$\1\_s*$\+` – Phoenix Jul 10 '13 at 21:04
I don't see how that would result in an infinite loop. First the `\s*` matches zero or more whitespace characters (as many as there are consecutively), then the + matches that one or more times. On its own, with nothing else inside the parentheses, the + is redundant (as far as I can tell `(\s*)+` does the exact same thing as `(\s*)`), but I don't see an infinite loop. Other languages that are capable of parsing this regex seem to agree with me. – Adi Inbar Jul 10 '13 at 21:17
Slater: Of course it wouldn't be implemented like that (look at the fact that `()+` doesn't fail in the same fashion). I was just pointing out that there's an intuitive reason, to me at least, for this not to be legit. Phoenix: As above, explicitly saying `()+` also works, so it probably has something to do with the variability of the repeated pattern's (non)existence. – a p Jul 10 '13 at 21:20
BTW...how would the regex match an infinite number of spaces on the first space it encounters, when the string doesn't have an infinite number of empty spaces in it? A regex can only match things that are actually there. ;) – Adi Inbar Jul 10 '13 at 21:22
Maybe I'm missing something, but I think if you have e.g. a DFA with a transition to a state on `ε`, and that state can again transition to itself on `ε`, you have a potential infinite loop. That was my reasoning at least, and it's just my intuition, not backed by any in-depth knowledge of `re`'s implementation. – a p Jul 10 '13 at 23:54

Is there a reason for python regex not to compile r'(\s*)+'?

2 Answers2