0

I'm taking a deep dive into how regexes work and am struggling to understand how to inverse a regex which contains backreferences.

To bring an example, let's say I don't want to match the words which contain the same character pairs, where secondary pair is inversed:

Words that it must exclude:
abba // (ab/ba pair)
smelled // (el/le pair)
trillion // (il/li pair)

I have this regex which captures words like this:

(((.)(.)).*\4\3)

But how do I go with inversing it? I tried applying negative lookahead, but it does not seem to work:

(?!((((.)(.)).*\4\3))
Banana
  • 814
  • 1
  • 8
  • 28

1 Answers1

0

You could use a negative lookahead with 2 capturing groups. Using the negative lookahead, you can rule out that a pattern like illi in trillion does not occur.

This will also rule out trillllon as group 1 and group 2 both contain an l

\b(?!\w*(\w)(\w)\2\1)\w+\b
  • \b Word boundary
  • (?! Negative lookahead, assert what is on the right is not
    • \w* Match 0+ word chars
    • (\w)(\w) 2 capturing groups after each other capturing a word char each
    • \2\1 2 back references after each other in reverse order
  • ) Close the lookahead
  • \w+ Match 1+ word chars
  • \b Word boundary

Regex demo

If you do want to match words like trillllion, you could use another negative lookahead before the backreference.

\b(?!\w*(\w)(\w)(?!\1)\2(?!\2)\1)\w+\b

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Could you provide some references or a quick edit with an explanation of how this works? I'm not trying to solve anything in particular, just learning how stuff works. While regex documentation is a good place to read the theory from, it does not have much practical examples of complex usage. – Banana Jun 09 '20 at 10:10
  • 1
    @Banana You have probably missed a site like rexegg.com – Wiktor Stribiżew Jun 09 '20 at 10:15
  • Thanks for the quick edit and explanation. I understand the syntax, I just cannot properly understand why some things are required the way they are. For instance, what does the word boundary in this situation help? Negative lookahead is relatively clear, the first `\w*`is a bit confusing though, why isn't the backreference enough for inversing (as the negative lookup in your example is similar as mine, just has the `\w*` as an addition. Adding the extra characters to the end kind of makes sense, but at the same time it brings a question why isn't the same in front of negative lookahead (`\w+`) – Banana Jun 09 '20 at 10:20
  • @WiktorStribiżew I saw that you also posted a pattern in the comments. Perhaps that will also help clarify the pattern for the OP. – The fourth bird Jun 09 '20 at 10:21
  • @Banana You mention words in your question, using a dot could possibly also match a space, that is why I used a word character. – The fourth bird Jun 09 '20 at 10:22
  • No, I understand using the different syntax, that's not an issue. I just have to admit that I'm conceptually weak. Maybe i'm taking too much at the time from your answer and if i'll shorten it down to points it's a bit easier to understand what I mean. For instance let's take the negative lookahead. `(\w)(\w)\2\1` finds the correct pairs. Why doesen't just applying the negative lookahead in front of it inverse it (i'm kind of thinking like switching a boolean here)? Why does adding `w*` fix the issue? – Banana Jun 09 '20 at 10:28
  • If you want to match whole words, if you don't use `\w*`, the lookahead it will only check what is **directly** to the right after the word boundary does not match the 2 groups. – The fourth bird Jun 09 '20 at 10:33