3

I'm trying to understand regex as much as I can, so I came up with this regex-based solution to codingbat.com repeatEnd:

Given a string and an int N, return a string made of N repetitions of the last N characters of the string. You may assume that N is between 0 and the length of the string, inclusive.

public String repeatEnd(String str, int N) {
  return str.replaceAll(
    ".(?!.{N})(?=.*(?<=(.{N})))|."
      .replace("N", Integer.toString(N)),
    "$1"
  );
}

Explanation on its parts:

  • .(?!.{N}): asserts that the matched character is one of the last N characters, by making sure that there aren't N characters following it.
  • (?=.*(?<=(.{N}))): in which case, use lookforward to first go all the way to the end of the string, then a nested lookbehind to capture the last N characters into \1. Note that this assertion will always be true.

  • |.: if the first assertion failed (i.e. there are at least N characters ahead) then match the character anyway; \1 would be empty.

  • In either case, a character is always matched; replace it with \1.

My questions are:

  • Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)
  • Is there a simpler regex-based solution?

Bonus question

Do repeatBegin (as analogously defined).

I'm honestly having troubles with this one!

davidism
  • 121,510
  • 29
  • 395
  • 339
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623

3 Answers3

3

Nice one! I don't see a way to significantly improve on that regex, although I would refactor it to avoid the needless use of negative logic:

".(?=.{N})|.(?=.*(?<=(.{N})))"

This way the second alternative is never entered until you reach the final N characters, which I think makes the intent a little clearer.

I've never seen a reference that says it's okay to nest lookarounds, but like Bart, I don't see why it wouldn't be. I sometimes use lookaheads inside lookbehinds to get around limitations on variable-length lookbehind expressions.


EDIT: I just realized I can simplify the regex quite a bit by putting the alternation inside the lookahead:

".(?=.{N}|.*(?<=(.{N})))"

By the way, have you considered using format() to build the regex instead of replace()?

return str.replaceAll(
  String.format(".(?=.{%1$d}|.*(?<=(.{%1$d})))", N),
  "$1"
);
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • +1! Good job! Do check out my answer, though: it uses only 2 assertions, `?=` and `?<=`. Also check out my other regex/codingbat question (`wordEnds`). It just got me a Tumbleweed. – polygenelubricants Apr 09 '10 at 12:23
  • @Alan: I have used `format`, but I think I prefer `replace` because it makes the regex part more readable, and almost an independent entity; with `format`, you can't just pass around the regex without specifying the arguments to `format` since their order matters. You can also see multiple occurrences more easily (e.g. if there are multiple `N`, `M`). Using `replace` also puts the before->after i.e. var->value mapping side by side, which I prefer. – polygenelubricants Apr 14 '10 at 04:47
1

Whoa, that's some scary regex voodoo there! : )

  • Is this technique of nested assertions valid? (i.e. looking behind during a lookahead?)

Yes, that is perfectly valid in most PCRE implementations I know of.

  • Is there a simpler regex-based solution?

I didn't spend too much time on it, but I don't quickly see how that could be simplified or shortened with a single regex replacement.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
0

Is there a simpler regex-based solution?

It took me a while, but eventually I managed to simplify the regex to:

"(?=.{0,N}$(?<=(.{N}))).|." // repeatEnd
          -or-
".(?<=^(?=(.{N})).{0,N})|." // repeatBegin

Like Alan Moore's answer, this removes the negative assertion, but doesn't even replace it with a positive one, so it now only has 2 assertions instead of 3.

I also like the fact that the "else" case is just a simple .. I prefer to put the bulk of my regex into the "working" side of the alternation, and keep the "non-working" side as simple as possible (usually a simple . or .*).

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • 1
    When I mentioned needless use of negative logic, I wasn't referring to the negative lookahead *per se*, but to the contortions you went through to get that lone `.` all by itself in the second alternative. That may seem clearer to you, but to my eye it has the opposite effect. – Alan Moore Apr 10 '10 at 01:37
  • It's definitely a matter of personal preference. I'd always prefer `if(!a && b) X else Y` to `if (a) Y else if (b) X` when they're interchangeable (which they aren't always). In any case, I'd appreciate your input on `repeatBegin` I mentioned in the update to the question; I'm honestly having a hard time with it. I have a solution that works, but it's quite different from `repeatEnd`. – polygenelubricants Apr 10 '10 at 03:44
  • `"(?<=.{N}|^(?=(.{N})).{0,N})."` seems to work. It's essentially the mirror image of my `repeatEnd` regex, except I had to use `{0,N}` instead of `*` to satisfy Java's "odious maximum length" requirement, plus `^` to force the lookbehind to go all the way back. – Alan Moore Apr 10 '10 at 11:46