153

I stumbled across a surprising (to me) fact.

console.log("asdf".replace(/.*/g, "x"));

Why two replacements? It seems any non-empty string without newlines will produce exactly two replacements for this pattern. Using a replacement function, I can see that the first replacement is for the entire string, and the second is for an empty string.

j08691
  • 204,283
  • 31
  • 260
  • 272
recursive
  • 83,943
  • 34
  • 151
  • 241
  • Why not three matches? There's also an empty string at the beginning. My question isn't about `/.+/`, since that one makes sense to me. I would think that `/.*/` would match as much as it could the first time, including the empty substring at the end. – recursive Apr 17 '20 at 02:36
  • 10
    more simple example: `"asdf".match(/.*/g)` return [ "asdf", ""] – Narro Apr 17 '20 at 02:51
  • The asterisk matches 0 or more of the preceding token, meaning it can match 0 characters. So it matches "asdf" plus an additional empty string after it. – Celsiuss Apr 17 '20 at 02:57
  • 35
    Because of the global (g) flag. The global flag allows for another search to start at the end of the previous match, thus finding an empty string. – Celsiuss Apr 17 '20 at 03:13
  • 6
    and lets be honest: probably noone wanted exactly that behavior. it was probably an implementation detail of wanting `"aa".replace(/b*/, "b")` to result in `babab`. And at some point we standardized all implementation details of webbrowsers. – Lux Apr 17 '20 at 03:17
  • @Lux this behavior has its origin in perl, where javascript copied this particular flavor of text substitution from. –  Apr 17 '20 at 12:01
  • Impressively, sed disagrees. – Joshua Apr 17 '20 at 22:34
  • 4
    @Joshua older versions of _GNU_ sed (not other implementations!) were also exhibiting this bug, which was fixed somewhere between the 2.05 and 3.01 releases (20+ years ago). I suspect it's there where this behaviour originated, before making its way into perl (where it became a feature) and from there into javascript. –  Apr 18 '20 at 06:32
  • @T.J.Crowder: That fact is less surprising to me. My *personal* surprise has to do with multiple matches being found. – recursive Apr 18 '20 at 14:16
  • 1
    @recursive - Fair enough. I find them both surprising for a second, then realize "zero-width match" and am no longer surprised. :-) – T.J. Crowder Apr 18 '20 at 14:29
  • My comment (without the "b" word) was: @Narro - Even simpler, `"".replace(/.*/g, "x")` => `"x"`. Zero-width matches are a pain. :) – T.J. Crowder Apr 18 '20 at 14:31
  • 1
    OH NOES! My beloved Ruby & Python do the same: `"asdf".gsub(/.*/, 'x')` and `re.sub('.*', 'x', 'asdf')` both return `'xx'`. I cannot complain about this Javascript WTF. :-( – Eric Duminil Apr 18 '20 at 14:47
  • @AleksiTorhamo: Good question, I didn't test it on older Python version. The behavior for empty matches has been changed for Python 3.7. `re.sub('.*', 'x', 'asdf')` outputs `'xx'` on my Python 3.7.6 – Eric Duminil Apr 19 '20 at 19:33

4 Answers4

110

As per the ECMA-262 standard, String.prototype.replace calls RegExp.prototype[@@replace], which says:

11. Repeat, while done is false
  a. Let result be ? RegExpExec(rx, S).
  b. If result is null, set done to true.
  c. Else result is not null,
    i. Append result to the end of results.
    ii. If global is false, set done to true.
    iii. Else,
      1. Let matchStr be ? ToString(? Get(result, "0")).
      2. If matchStr is the empty String, then
        a. Let thisIndex be ? ToLength(? Get(rx, "lastIndex")).
        b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).
        c. Perform ? Set(rx, "lastIndex", nextIndex, true).

where rx is /.*/g and S is 'asdf'.

See 11.c.iii.2.b:

b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).

Therefore in 'asdf'.replace(/.*/g, 'x') it is actually:

  1. result (undefined), results = [], lastIndex = 0
  2. result = 'asdf', results = [ 'asdf' ], lastIndex = 4
  3. result = '', results = [ 'asdf', '' ], lastIndex = 4, AdvanceStringIndex, set lastIndex to 5
  4. result = null, results = [ 'asdf', '' ], return

Therefore there are 2 matches.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Alan Liang
  • 1,087
  • 1
  • 8
  • 21
39

Together in an offline chat with yawkat, we found an intuitive way of seeing why "abcd".replace(/.*/g, "x") exactly produces two matches. Note that we haven't checked whether it completely equals the semantics imposed by the ECMAScript standard, hence just take it as a rule of thumb.

Rules of Thumb

  • Consider the matches as a list of tuples (matchStr, matchIndex) in chronological order that indicate which string parts and indices of the input string have already been eaten up.
  • This list is continuously built up starting from the left of the input string for the regex.
  • Parts already eaten up cannot be matched anymore
  • Replacement is done at indices given by matchIndex overwriting the substring matchStr at that position. If matchStr = "", then the "replacement" is effectively insertion.

Formally, the act of matching and replacement is described as a loop as seen in the other answer.

Easy Examples

  1. "abcd".replace(/.*/g, "x") outputs "xx":

    • The match list is [("abcd", 0), ("", 4)]

      Notably, it does not include the following matches one could have thought of for the following reasons:

      • ("a", 0), ("ab", 0): the quantifier * is greedy
      • ("b", 1), ("bc", 1): due to the previous match ("abcd", 0), the strings "b" and "bc" are already eaten up
      • ("", 4), ("", 4) (i.e. twice): the index position 4 is already eaten up by the first apparent match
    • Hence, the replacement string "x" replaces the found match strings exactly at those positions: at position 0 it replaces the string "abcd" and at position 4 it replaces "".

      Here you can see that replacement can act as true replacement of a previous string or just as insertion of a new string.

  2. "abcd".replace(/.*?/g, "x") with a lazy quantifier *? outputs "xaxbxcxdx"

    • The match list is [("", 0), ("", 1), ("", 2), ("", 3), ("", 4)]

      In contrast to the previous example, here ("a", 0), ("ab", 0), ("abc", 0), or even ("abcd", 0) are not included due to the quantifier's laziness that strictly limits it to find the shortest possible match.

    • Since all match strings are empty, no actual replacement occurs, but instead insertions of x at positions 0, 1, 2, 3, and 4.

  3. "abcd".replace(/.+?/g, "x") with a lazy quantifier +? outputs "xxxx"

    • The match list is [("a", 0), ("b", 1), ("c", 2), ("d", 3)]
  4. "abcd".replace(/.{2,}?/g, "x") with a lazy quantifier [2,}? outputs "xx"

    • The match list is [("ab", 0), ("cd", 2)]
  5. "abcd".replace(/.{0}/g, "x") outputs "xaxbxcxdx" by the same logic as in example 2.

Harder Examples

We can consistently exploit the idea of insertion instead of replacement if we just always match an empty string and control the position where such matches happen to our advantage. For example, we can create regular expressions matching the empty string at every even position to insert a character there:

  1. "abcdefgh".replace(/(?<=^(..)*)/g, "_")) with a positive lookbehind (?<=...) outputs "_ab_cd_ef_gh_" (only supported in Chrome so far)

    • The match list is [("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]
  2. "abcdefgh".replace(/(?=(..)*$)/g, "_")) with a positive lookahead (?=...) outputs "_ab_cd_ef_gh_"

    • The match list is [("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]
ComFreek
  • 29,044
  • 18
  • 104
  • 156
  • 4
    I think it's a bit of a stretch to call it intuitive (and in bold, at that). To me it looks more like Stockholm syndrome and post-hoc rationalization. Your answer is good, BTW, I only complain about JS design, or lack of design for that matter. – Eric Duminil Apr 18 '20 at 12:16
  • 8
    @EricDuminil I thought so too at first, but after having written the answer, the sketched global-regex-replace algorithm seems to be exactly the way one would come up with it if one started from scratch. It's like `while (!input not eaten up) { matchAndEat(); }`. Also, [comments above](https://stackoverflow.com/questions/61263151/why-is-asdf-replace-g-x-xx/61270591?noredirect=1#comment108417048_61263151) indicate that the behavior originated from long ago before JavaScript's existence. – ComFreek Apr 18 '20 at 12:58
  • 2
    The part that still doesn’t make sense (for any other reason than “that’s what the standard says”) is that the four-character match `("abcd", 0)` does not eat the position 4 where the following character would go, yet the zero-character match `("", 4)` does eat the position 4 where the following character would go. If I were designing this from scratch, I think the rule I’d use is that `(str2, ix2)` may follow `(str1, ix1)` iff `ix2 >= ix1 + str1.length() && ix2 + str2.length() > ix1 + str1.length()`, which does not cause this misfeature. – Anders Kaseorg Apr 20 '20 at 09:35
  • 2
    @AndersKaseorg `("abcd", 0)` does not eat position 4 becaues `"abcd"` is only 4 characters long and hence just eats indices 0, 1, 2, 3. I can see where your reasoning might come from: why cannot we have `("abcd" ⋅ ε, 0)` as a 5-character-long match where ⋅ is concatenation and `ε` the zero-width match? Formally because `"abcd" ⋅ ε = "abcd"`. I thought about an intuitive reason for the last minutes, but failed to find one. **I guess one must always treat `ε` as just occuring on its own as `""`.** I'd love to play with an alternative implementation without that bug or feat., feel free to share! – ComFreek Apr 20 '20 at 10:43
  • 1
    If the four character string should eat four indices, then the zero character string should eat no indices. Any reasoning you might make about one should equally apply to the other (e.g. `"" ⋅ ε = ""`, although I’m not sure what distinction you intend to draw between `""` and `ε`, which mean the same thing). So the difference cannot be explained as intuitive—it simply is. – Anders Kaseorg Apr 20 '20 at 11:20
31

The first match is obviously "asdf" (Position [0,4]). Because the global flag (g) is set, it continues searching. At this point (Position 4), it finds a second match, an empty string (Position [4,4]).

Remember that * matches zero or more elements.

Toothbrush
  • 2,080
  • 24
  • 33
David SK
  • 814
  • 7
  • 21
  • 4
    So why not three matches? There could be another empty match at the end. There are precisely two. This explanation explains why there *could* be two, but not why there should be instead of one or three. – recursive Apr 17 '20 at 03:32
  • 7
    No, there are not other empty string. Because that empty string has been found. an empty string on the position 4,4, It is detected as a unique result. A match labeled "4,4" cannot be repeated. probably you can think that there are an empty string in the position [0,0] but the * operator returns the maximum possible of elements. this is the reason that only 4,4 is possible – David SK Apr 17 '20 at 03:46
  • 16
    We have to remember that regexes are not regular expressions. In regular expressions, there are infinitely many empty strings in between every two characters, as well as at the beginning and at the end. In regexes, there are exactly as many empty strings as the specification for the particular flavor of regex engine says there are. – Jörg W Mittag Apr 17 '20 at 16:15
  • 7
    This is just post-hoc rationalization. –  Apr 17 '20 at 18:01
  • 9
    @mosvy except that it's the exact logic that's actually used. – hobbs Apr 18 '20 at 01:13
1

simply, the first x is for the replacement of matching asdf.

second x for the empty string after asdf. Search terminates when empty.

Nilanka Manoj
  • 3,527
  • 4
  • 17
  • 48