2

Why do I get two matches when using the regular expression .* on the string abcd 1234 abcd? See https://regex101.com/r/rV8jfz/1.

From the explanation given by regex101, I can see that the second match happened at position 14-14 and the value matched is null. But why is a second match done? Is there a way that I can avoid the second match?

I understand .* means zero or more of any character, so it's trying to find zero occurrences. But I don't understand why this null match is required.
The problem is when used in any language (e.g. Java), when I do while(matcher.find()) { ... }, this would loop twice while I would want it to loop only once.

I know this could not be a real world match situation, but to understand and explore regex, I see this as a good case to study.

Edit - follwing @terdon response. I did like to keep the /g option in regex101, i am aware about it. I would like to know the total possible matches.
https://regex101.com/r/EvOoAr/1 -> pattern abcd against string abcd 1234 abcd gives two matches. And i wan't to know this information.

the problem i find is, when dealing this in a language like java -
Ref - https://onecompiler.com/java/3xnax494k

  String str = "abcd 1234 abcd";
  Pattern p = Pattern.compile(".*");
  Matcher matcher = p.matcher(str);
  int matchCount=0;
  while(matcher.find()) {
    matchCount++;
    System.out.println("match number: " + matchCount);
    System.out.println("matcher.groupCount(): " + matcher.groupCount());
    System.out.println("matcher.group(): " + matcher.group());
  }

The output is -

match number: 1
matcher.groupCount(): 0  //you can ignore this
matcher.group(): abcd 1234 abcd
match number: 2
matcher.groupCount(): 0
matcher.group():  //this is my concern. The program has to deal with this nothing match some how.

It would be nice for me as a programmer, if the find() did not match against "nothing". I should add additional code in the loop to catch this "nothing" case.

This null problem (in code) will get even worse with this regex case - https://regex101.com/r/5HuJ0R/1 -> [0-9]* against abcd 1234 abcd gives 12 matches.

pppery
  • 3,731
  • 22
  • 33
  • 46
samshers
  • 1
  • 6
  • 37
  • 84
  • Can you explain what you want to match? What would the loop be doing? If you're trying to loop over all characters, you would use `.` not `.*`. Also, in the regex101 link you have given, you are using the `g` (global) modifier. Why do you want that? This is what is causing the multiple matches. Is this actually something you want? – terdon Dec 24 '21 at 15:44
  • @terdon - you are right about the `/g` option. But i want it. I will add some more edits in a while. – samshers Dec 24 '21 at 15:57
  • @terdon, edited further, describing the issues i see. – samshers Dec 24 '21 at 16:16
  • I don't speak Java, but there has to be a way to turn off global matching. You would never want to use global matching with a regex like `.*` which, by definition, consumes the entire string. If not, this should be considered a bug in Java. – terdon Dec 24 '21 at 22:30
  • i see java doc/tutorial has mentioned the same [Zero-Length Matches](https://docs.oracle.com/javase/tutorial/essential/regex/quant.html). @terdon - this is how it is. – samshers Dec 26 '21 at 08:39
  • I'm pretty sure there's a duplicate somewhere. Let me go hunting … → https://stackoverflow.com/questions/61263151/why-is-asdf-replace-g-x-xx/61270591, https://stackoverflow.com/questions/31701862/js-regex-quantifiers-and-global-flag-outputs-empty-string-as-the-last-element-in – knittl Feb 15 '22 at 19:17

1 Answers1

5

The reason you get two matches is because you are using the g (global) operator. If you remove that from your regex101 example, you will only get one match.

This happens because the global operator makes the regex engine try to find as many matches on the string as possible. Since the expression .* matches everything, it also matches nothing, i.e. the empty string. Therefore, the first match is the entire string and then the second match is matching the "nothing" that comes after, it is matching an empty string. Removing the g will make it stop at the first match, the entire string, and not try to find others:

screenshot of the regex101 webpage with the relevant options indicated

terdon
  • 3,260
  • 5
  • 33
  • 57
  • i would like to keep the `g` option in regex101. But 4 ur other info ++1. I see this nothing match will need aditional handling when dealing from a programming language. – samshers Dec 24 '21 at 16:23
  • @samshers Also note that the website is for testing regular expressions in specific languages, none of which uses standard POSIX regular expressions. Even Perl does not count `/.*/g` as matching twice. The command `perl -ne 'print scalar (/.*/gm)'` will read from standard input and print the number of times the expression matches. You will not be able to get it to print anything other than `1`. Likewise you will not be able to get `sed 's/.*/*/g'` to print anything other than a single `*` per input line. The web site is broken. Don't use it. At least not as an authoritative reference. –  Dec 24 '21 at 18:08
  • @samshers I don't understand. You would never use `g` in a real program with `.*`, that makes no sense. So why would you insist on keeping it on regex101 now that you know that this is only causing you problems and doesn't add anything useful? And no, you don't need additional handling when using this in a program since you won't be doing `while (/.*/)`. What would be the point of that? The `.*` will always consume the entire string in one go, and will also match on empty strings, so using it in a `while` won't do anything useful. – terdon Dec 24 '21 at 22:28