Use of \K and lookahead not working as expected

Question

I didn't find mention of \K in https://ruby-doc.org/core-2.5.0/Regexp.html, but it seems it is implemented(Support of \K in regex) - version I have is 2.5.0p0

However, it is not working as I expected (based on the behavior of perl) for this example:

$ # expected behavior, replace empty fields with NA where comma is separator
$ echo ',a,,,b,' | ruby -pe 'gsub(/(?<=^|,)(?=,|$)/, "NA")'
NA,a,NA,NA,b,NA
$ # why a,,,b is not changing to a,NA,NA,b here?
$ echo ',a,,,b,' | ruby -pe 'gsub(/(^|,)\K(?=,|$)/, "NA")'
NA,a,NA,,b,NA

$ # reference from perl, where ^|, is considered as variable length
$ echo ',a,,,b,' | perl -pe 's/(^|,)\K(?=,|$)/NA/g'
NA,a,NA,NA,b,NA
$ echo ',a,,,b,' | perl -pe 's/(?<=^|,)(?=,|$)/NA/g'
Variable length lookbehind not implemented in regex m/(?<=^|,)(?=,|$)/ at -e line 1

Note: I am specifically looking to understand \K and lookarounds in ruby, not looking for other ways to solve this problem, for ex:

$ echo ',a,,,b,' | ruby -lne 'print $_.split(",",-1).map { |s| s=="" ? "NA" : s }.join","'
NA,a,NA,NA,b,NA

score 1 · Accepted Answer · answered Jan 22 '18 at 08:56

1

The (?<=^|,)(?=,|$) matches like this: the first match is the start of the string as it is followed with ,; the second matchis between the second and the third comma; after checking the position after the second comma, the position after the third comma is checked, and the third match is found; the last match is at the end of the string, as expected, as there is a , followed with $ (end of string).

The (^|,)\K(?=,|$) pattern behavior in Ruby (Onigmo regex engine) and PCRE differs, you may easily check this at regex101.com. While in PCRE the \K construct matches the empty string/location right after the third comma, Onigmo regex engine cannot match it due to the fact that the regex index is moved/set "manually" to skip the currently tested char if the match is an empty string. It means that after matching and consuming the second ,, the matched text is omitted, and then the regex engine is forced to jump to the location after the third comma. And that means that there is no way for the (^|,)\K(?=,|$) pattern to match between , and b.

answered Jan 22 '18 at 08:56

Wiktor Stribiżew

607,720
39
448
563

thanks, if am I understanding this correctly, Onigmo regex engine will give different results for `/,\K(?=,)/` and `/(?<=,)(?=,)/` ... what is your opinion - `/,\K(?=,)/` and `/(?<=,)(?=,)/` should be same or different? or that is left to particular regex engine implementation? – Sundeep Jan 22 '18 at 09:24
1

@Sundeep If you ask me, PCRE implementation is the correct one, but both have their advantages and disadvantages in different scenarios, and I also think PCRE implementation is a bit more costly (although more "correct"). – Wiktor Stribiżew Jan 22 '18 at 11:12
thanks again.. do you have any official-sort of link explaining that this is the expected behavior? if not I'll probably open an issue on ruby site and ask... I feel it is a bug.. I experimented some more and even simple cases will be `gsub(/xyz/, "\\0NA")` instead of `gsub(/xyz\K/, "NA")`.. perl and vim(\zs) behave the same.. just checked, python regex module \K is same as ruby – Sundeep Jan 22 '18 at 11:26
1

@Sundeep I would not call it a bug, it is more like an "undefined behavior". Only use `\K` when you have non-overlapping matches with a non-empty string as a return value, else, whenever you need to get empty matches, use the lookaround. – Wiktor Stribiżew Jan 22 '18 at 12:25

Use of \K and lookahead not working as expected

1 Answers1