2

I got this result the regexp in the Ruby.

At first, a question itself is not the same. And this is not the same question because the answer is different. And also discussion in the comment makes the difference as well.

In the first result, it looks .* matching nothing after matching whole 'hello'.

But why does it happen?

[53] pry(main)> "hello".gsub(/.*/, "abc")
=> "abcabc"
[54] pry(main)> "hello".gsub(/^.*$/, "abc")
=> "abc"
Hidehiro NAGAOKA
  • 417
  • 3
  • 12
  • Because you are matching 0 or more of anything and replacing it with abc for the second line. The first line looks to be replacing 'hel' with abc and the 'lo' with abc. – JayRizzo Jul 08 '19 at 02:30
  • By adding the ^ you are signifying that you want to replace the entirety of the line to abc. – JayRizzo Jul 08 '19 at 02:31
  • 1
    Why does it split into 'hel' and 'lo'? – Hidehiro NAGAOKA Jul 08 '19 at 02:48
  • 3
    @JayRizzo `"hello".gsub(/.*/) { puts $&.inspect; "abc" }` says otherwise. – mu is too short Jul 08 '19 at 03:00
  • 1
    Interestingly enough, you only need to anchor it at the beginning to get it to make "sense": `"hello".gsub(/^.*/, "abc")` or `"hello".gsub(/\A.*/, "abc")`. Anchoring at the end with `$`, `\z`, or `\Z` does nothing. Using `gsub(/.+/)` produces the expected result of course. Presumably `.*` is matching the "nothing" at the end of the string since `*` means "zero or more" – mu is too short Jul 08 '19 at 03:06
  • 1
    Not an answer, but here's some data. At rubular.com there is one match, the entire string. At regex101.com, with no options, there is one match. Adding the "global" `/g` option ("Don't return after first match") there are two matches, one an "empty match". `/g` is not support with Ruby, however. – Cary Swoveland Jul 08 '19 at 03:11
  • 1
    @JayRizzo, let's have your reasoning for the second sentence of your first comment. Guesses and opinions are of no value. – Cary Swoveland Jul 08 '19 at 03:19
  • 1
    Another data point: `"hello".scan /.*/ #=> ["hello", ""]`. – Cary Swoveland Jul 08 '19 at 03:41
  • @CarySwoveland thx! I got the same result on the same website. Does 'g' in regex101 mean 'greedy' of the regexp? Regexp is greedy by default, so it makes sense. But even it's greedy, does 'matching nothing after matching the whole string' make sense? – Hidehiro NAGAOKA Jul 08 '19 at 03:48
  • 2
    Hidehiro, see @matiska's answer to [this SO quesion](https://stackoverflow.com/questions/12993629/what-is-the-meaning-of-the-g-flag-in-regular-expressions). – Cary Swoveland Jul 08 '19 at 03:54
  • @WiktorStribiżew, please have a look at this. – Cary Swoveland Jul 08 '19 at 04:01
  • 1
    @CarySwoveland: Rubular does show you the second match if you toggle "Show invisibles" checkmark. (It looks like a space, but it's not - use Inspect on it to see it's two different `` elements, the second of which is empty.) – Amadan Jul 08 '19 at 04:12
  • @CarySwoveland you are correct, I misspoke and had a misunderstanding of it. The reason for not duplicating is because of the overlap of the pattern, is my understanding. Cheers! – JayRizzo Sep 11 '19 at 05:54

2 Answers2

4

The important bit is that a regexp can never match twice at same position. The matches also cannot overlap. Furthermore, note that there are six possible positions involved in "hello": one at the start of each character, and one at the very end (see fenceposting).

When you start searching for /.*/, there's a match at position 0, and it takes up five characters. This disqualifies positions 0, 1, 2, 3 and 4 from further matches (as they are part of the first match).

The second match starts matching at position 5, and finds a match for "0 or more characters" - namely, 0 characters. The position 5 is not contained in the first match, and so not disqualified by the "no overlap" rule.


When you anchor the start with /^.*/, the position 5 becomes ineligible, as it is not the start.

When you anchor the end with /.*$/, both position 0 and position 5 will detect that after their 5-character or 0-character match respectively they are at the end of the search string, and thus you still get both matches.

When you change the regexp to "1 or more characters" with /.+/, then the position 5 is again ineligible because there is no more characters to match, but at least 1 is required.


Note also that it is not just Ruby, the same behaviour is found in all the engines I tested. Python's sub is a bit inconsistent (possibly because of its adjacency condition? Not sure), but findall reports the same two matches:

re.findall('.*', 'hello') # => ['hello', '']

JavaScript works just like Ruby:

"hello".replace(/.*/g, "abc") // => "abcabc"

As does Java:

"hello".replaceAll(".*", "abc") // => "abcabc"

And even PHP (using PREG):

preg_replace('/.*/', 'abc', 'hello'); # => "abcabc"
Amadan
  • 191,408
  • 23
  • 240
  • 301
1

This is because regex engine does not go back, meaning when it matched some text, it will never go back inside matched text, i.e. matvhes won't overlap.

You used * quantifier, meaning that it is greedy, so it will match as much as possible. If you'd use *?, then you'd get match at every position at a string, because ? makes it non-greedy, so it will match at least as possible. While * means zero or more characters, you'd get 0-length matches.

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69