18

In the "Advanced Regular Expresssion" chapter in Mastering Perl, I have a broken example for which I can't figure out a nice fix. The example is perhaps trying to be too clever for its own good, but maybe someone can fix it for me. There could be a free copy of the book in it for working fixes. :)

In the section talking about lookarounds, I wanted to use a negative lookbehind to implement a commifying routine for numbers with fractional portions. The point was to use a negative lookbehind because that was the topic.

I stupidly did this:

$_ = '$1234.5678';
s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g;  # $1,234.5678

The (?<!\.\d) asserts that the bit before the (?=(?:\d\d\d)+\b) is not a decimal point and a digit.

The stupid thing is not trying hard enough to break it. By adding another digit to the end, there is now a group of three digits not preceded by a decimal point and a digit:

$_ = '$1234.56789';
s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g;  # $1,234.56,789

If lookbehinds could be variable width in Perl, this would have been really easy. But they can't.

Note that it's easy to do this without a negative lookbehind, but that's not the point of the example. Is there a way to salvage this example?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • 1
    FYI, You left out the `\b` to `(?:\d\d\d)+\b)` when you posted this, so I added it. (I checked, and it *is* there in the book.) That's just a distraction, though; it has nothing to do with the lookbehind problem. – Alan Moore Feb 25 '10 at 00:25
  • Ah, yes, thanks. When I copied and pasted that from my email to test it, something converted the \b to a ^B and messed everything up. I forget to re-add it. – brian d foy Feb 25 '10 at 00:43
  • 1
    You might want to consider putting a bounty on this question for extra motivation (although the book is very thoughtful!), since you have the rep to spare. :) It may also get you more eyes, as it's possible to search for questions with active bounties from the front page. – Ether Feb 25 '10 at 01:39
  • 1
    @FM: the point of the question is to use `(?<!)`. I'm not looking for ways around it. – brian d foy Feb 25 '10 at 21:08
  • Surprised no one edits the title... Was LOLing at the wordings – SwiftMango May 17 '15 at 17:44

3 Answers3

14

I don't think it's possible without some form of variable-width look-behind. The addition of the \K assertion in 5.10 provides a way of faking variable-width positive look-behind. What we really need is variable-width negative look-behind but with a little creativity and a lot of ugliness we can make it work:

use 5.010;
$_ = '$1234567890.123456789';
s/(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)/,/g;
say;  # $1,234,567,890.123456789

If there was ever a pattern that begged for the /x notation it's this one:

s/
  (?<!\.)        # Negative look-behind assertion; we don't want to match
                 # digits that come after the decimal point.

  (?:            # Begin a non-capturing group; the contents anchor the \d
                 # which follows so that the assertion above is applied at
                 # the correct position.

    \b           # Either a word boundary (the beginning of the number)...

    |            # or (because \b won't match at subsequent positions where
                 # a comma should go)...

    \G           # the position where the previous match left off.

  )              # End anchor grouping

  \d+?           # One or more digits, non-greedily so the match proceeds
                 # from left to right. A greedy match would proceed from
                 # right to left, the \G above wouldn't work, and only the
                 # rightmost comma would get placed.

  \K             # Keep the preceding stuff; used to fake variable-width
                 # look-behind

                 # <- This is what we match! (i.e. a position, no text)

  (?=            # Begin a positive look-ahead assertion

    (?:\d\d\d)+  # A multiple of three digits (3, 6, 9, etc.)

    \b           # A word (digit) boundary to anchor the triples at the
                 # end of the number.

  )              # End positive look-ahead assertion.
/,/xg;
Michael Carman
  • 30,628
  • 10
  • 74
  • 122
  • Brilliant. I never considered using \K since I wrote _Mastering Perl_ pre-5.10. I think I can make this work, if only for a completely too-clever example where I can point out the extreme difficulty of variable width lookbehinds. The `\G` is quite the bonus too. That's a free book for you. If you already have _Mastering Perl_, tell me which other book I can get you. :) – brian d foy Feb 25 '10 at 23:19
  • As pleased as I am about finding a solution within the constraints of the problem I'm somewhat appalled at my creation, particularly the use of an alternation between zero-width assertions. I needed `use re 'debug'` to figure out that the `\G` was necessary. I did benchmark it just for fun and it's about 10% faster than the FAQ answers. That's probably because it doesn't use captures. I don't have a copy of *Mastering Perl* so that would be great. Hmm... there's no PM system here, but you should be able to reach me via my CPAN author ID (MJCARMAN). – Michael Carman Feb 26 '10 at 02:54
  • For what it's worth, I re-used this example in Mastering Perl, 2nd Edition, but not to show off lookaheads. I used to illustrate \K :) – brian d foy Apr 27 '14 at 21:02
  • @briandfoy: Cool! I'm glad to have helped. – Michael Carman Apr 28 '14 at 14:58
4

If you have to post on Stack Overflow asking if somebody can figure out how to do this with negative lookbehind, then it's obviously not a good example of negative lookbehind. You'd be better off thinking up a new example rather than trying to salvage this one.

In that spirit, how about an automatic spelling corrector?

s/(?<![Cc])ei/ie/g; # Put I before E except after C

(Obviously, that's not a hard and fast rule in English, but I think it's a more realistic application of negative lookbehind.)

cjm
  • 61,471
  • 9
  • 126
  • 175
  • 1
    Yes, I think I'll have to abandon the example, which is too bad. I already have simpler examples like the one that you describe, though. However, I should also say that just because I can't figure it out doesn't mean it's not a good example. The best way to learn anything is to write a book on it. I have learned quite a bit from my technical reviewers. :) – brian d foy Feb 25 '10 at 00:41
  • 1
    I just noticed the italicized "you". I think there a many people much smarter and better at Perl than me on Stackoverflow. I'm just here a lot. :) – brian d foy Feb 25 '10 at 00:47
  • @cjm Can you please add examples of the spelling rule you are referring to? I did not know English even *has* spelling rules. .-) Also I think you mean "swap i and e" rather than "put i before e". – Alois Mahdal Feb 26 '13 at 02:30
  • @AloisMahdal, there are [plenty of examples on Wikipedia](https://en.wikipedia.org/wiki/I_before_E_except_after_C). – cjm Feb 26 '13 at 02:44
0

I don't think this is what you are after (especially becaue the negative look-behind assertion has been dropped), but I guess, your only option is to slurp up the decimal places like in this example:

s/
  (?:
    (?<=\d)
    (?=(?:\d\d\d)+\b)
   |
    ( \d{0,3} \. \d+ )
  )
 / $1 ? $1 : ',' /exg;

P.S. I think it is a good example when not used as the first one in the book, as it demonstrates some of the pitfalls and limitations of look-around assertions.

willert
  • 962
  • 9
  • 12
  • It is in fact the last example in the book for these things. The problem with this answer, however, is that the `(?<!\.)` doesn't do anything. If you remove it you get the same answer. :) – brian d foy Feb 25 '10 at 10:16