2

I tried to match .* with C# regular expression, and it turns out it matches any string two times: first the full string, than a second time an empty string. I expected .* to match everything in a single match. I'm completely puzzled why that should be and how to prevent this.

Long story: I need to replace parts of filenames, with the possibility to replace unconditionally by a certain replacement string. Using an empty string as pattern will match and put the replacement after every character of the string, like it is described in Regex.Replace. Therefor I substitute the empty string by .* before replacement. But this turns out to execute the replacement double.

To demonstrate what is going on I used:

string input= "sometext";
string pattern= ".*";

MatchCollection matches = Regex.Matches(input, pattern);

foreach (Match match in matches) {
    Console.WriteLine("[{0}]", match.Groups[0].Value); }

which yields:

[sometext]
[]
  • Why does it match a second time the empty string when it already matched the whole string?
  • What regex or flags do I have to use to get only a single match/replacement?
trapicki
  • 1,881
  • 1
  • 18
  • 24
  • Welcome to StackOverflow! Please see ["Should questions include “tags” in their titles?"](http://meta.stackexchange.com/questions/19190/should-questions-include-tags-in-their-titles), where the consensus is "no, they should not", please try to find a more meaningful title for your question! –  Dec 09 '14 at 14:40
  • @Alex Also [this question](http://stackoverflow.com/questions/148518/how-to-regex-search-replace-only-first-occurrence-in-a-string-in-net) for the "how do I stop it" half of the question. – Rawling Dec 09 '14 at 14:56
  • As [http://stackoverflow.com/questions/8604286/string-replaceall-anomaly-with-greedy-quantifiers-in-regex] explains, regex engines behave different from that, and I come from the sed, awk, grep regex world where this does not occur. Would be nice if that question could be changed to reflect which engines behave like that and which do not. – trapicki Dec 09 '14 at 15:18
  • Please do not close the question *that* fast. I had no time to post my own solution to the problem. – trapicki Dec 09 '14 at 15:23
  • "String.replaceAll() anomaly with greedy quantifiers in regex" does not explain what is going on. Not helpful here. – trapicki Dec 09 '14 at 15:25
  • @trapicki It explains *exactly* what is going on. It even explains your questions about why it only matches twice, no more. – Rawling Dec 09 '14 at 15:35

2 Answers2

3
  • Why does it match a second time the empty string when it already matched the whole string?

    Because the regex is .* which will match zero or more occurence. Thus the zero occurence give a null string, from the end of string input

Solution

Using

.+

matches one or more characters

string text = "sometext";
string expression = ".+";

MatchCollection matches = Regex.Matches(text, expression);

foreach (Match match in matches) {
    Console.WriteLine("[{0}]", match.Groups[0].Value); }

Gives the output as

[sometext]
Community
  • 1
  • 1
nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52
  • I still want to know why the 0-length match happens. – Stilgar Dec 09 '14 at 14:48
  • He told you. "the zero occurence give a null string, from the end of string input" – John Saunders Dec 09 '14 at 14:49
  • @Stilgar That what i wrote at the start of the answer. It can match zero characters at the end of the input string – nu11p01n73R Dec 09 '14 at 14:50
  • 1
    Still. It matches "zero or more" occurences. Then why doesn't it return a match for one, two, three, ... characters? Likewise, '+' matches "one or more occurences". Why doesn't it return a match for a single character? – Kevin Gosse Dec 09 '14 at 14:52
  • @KooKiz Good question. Because `.*` is greedy and will consume every character. There is nothing left for to match the second time. But it can match a zero characters giving a success – nu11p01n73R Dec 09 '14 at 14:54
  • OK why can it match 0 characters. Why doesn't the first match swallow the 0 characters? – Stilgar Dec 09 '14 at 14:54
  • OK even then why aren't there infinitely many 0-length matches. I mean if the first match stops and then a next attempt is made that matches nothing why isn't there a third attempt that again matches nothing. – Stilgar Dec 09 '14 at 14:56
  • @Stilgart That is with the greedy. If it can consume. Then it will consume as much as possible. That is it will match the maximum charactes it can – nu11p01n73R Dec 09 '14 at 14:56
  • 2
    No it is not related to greediness. If greediness was enough we would have infinitely many matches. See the answer @Alex K. suggested as a duplicate. – Stilgar Dec 09 '14 at 14:59
  • @Stilgar greediness as in *why doesn't it return a match for one, two, three, .* But i think i wrote the something similar to what is provided in the answer as `.*` attempts to match a zero length at the end of the string – nu11p01n73R Dec 09 '14 at 15:04
  • @Stilgar *OK why can it match 0 characters. Why doesn't the first match swallow the 0 characters?* This is where greediness comes into play. And not in the end. Sorry i provided some confusing comments – nu11p01n73R Dec 09 '14 at 15:05
0

As String.replaceAll() anomaly with greedy quantifiers in regex explains in detail, * behaves very greedy in C#/.Net and matches the empty string at the end of a string too.

My solution is to anchor the pattern: ^.*$. This does the job and seems most understandable, that is "Match everything from the beginning to the end once."

A different possibility is to use .+, which consumes the whole input string and can not match a second time. It has the drawback to not match an empty string, though.

Community
  • 1
  • 1
trapicki
  • 1,881
  • 1
  • 18
  • 24