Why is Ruby's string.sub() not greedy?

Question

How is the following behavior explained? (running with Ruby 2.4.2)

> "hello\r\n".sub(/e*/, "")
 => "hello\r\n" 

> "hello\r\n".sub(/h*/, "")
 => "ello\r\n" 

> "hello\r\n".sub(/e+/, "")
 => "hllo\r\n" 

> "hello\r\n".sub(/(\r|\n)*/, "")
 => "hello\r\n" 

> "hello\r\n".sub(/(\r|\n)+/, "")
 => "hello"

For (1), how the e is not matched and replaced by "", versus (2) the h is? And then when it is e+, then it is matched? (so e* is "non-greedy"? Isn't it by default greedy?)

It is similar for the 4th and 5th cases. I know I can use gsub, but how is the behavior of sub explained?

Have you read the docs? For me, it seems to explain it pretty well: https://ruby-doc.org/core-2.4.0/String.html#method-i-sub — Sweeper, Jan 19 '20 at 09:07

score 1 · Accepted Answer · answered Jan 19 '20 at 09:12

1

According to the docs:

Returns a copy of str with the first occurrence of pattern replaced by the second argument.

The keyword here is "first". If I match hello\r\n against e*, what's my first match gonna be? A 0-width match at position 0, isn't it? Yes, e* will greedily match the e in hello, but that's not the first match. It needs to match all the 0-width matches before that first.

On the other hand, e+ can't match any 0-width matches, so the first match is the match you expect.

For h*, the first match is the letter h because it is the first letter in the word! There are no 0-width matches before it.

The same logic applies to the other cases as well.

answered Jan 19 '20 at 09:12

Sweeper

213,210
22
193
313

No, sorry, please disregard the crap I posted above. Everything is perfectly correct. – Aleksei Matiushkin Jan 19 '20 at 09:15
"For `h*`, the first match is the letter h because it is the first letter in the word! There are no 0-width matches before it." - You state this as if it was obvious, but actually, for *regular expressions* (the mathematical concept on which `Regexp` is based), there are *infinitely many* empty strings at the beginning, at the end, and in between every character, so it is understandable that one might get confused by the fact that this is not true for `Regexp`s. The main takeaway is, I guess: even though `Regexp` are based on regular expressions and some people even call them "regular – Jörg W Mittag Jan 19 '20 at 09:22
… expressions", that does by no means mean that they *behave like* regular expressions. – Jörg W Mittag Jan 19 '20 at 09:24
@JörgWMittag Oh I see. I didn't really study regex the math concept, so that's something new for me. Thanks. – Sweeper Jan 19 '20 at 09:28
the "nothing" would be considered a match... nothing has slipped my mind... – nonopolarity Jan 19 '20 at 09:38
@JörgWMittag that's true... if nothing can be considered to be a match... there are infinitely many matches really, before it can get to `e`... but I guess once it matched nothing, then it advances 1 position anyway: `"hello".scan(/e*/) => ["", "e", "", "", "", ""]` – nonopolarity Jan 19 '20 at 09:58
@JörgWMittag While you're right to correct his assertion about there not being any zero-width matches before the `h`, this is just a mistake he's made, it has nothing to do with a difference between Ruby and classical regular expressions (there _are_ many such differences, but I don't think this is one of them). – philomory Jan 20 '20 at 23:34

Why is Ruby's string.sub() not greedy?

1 Answers1