1

I'm using regex with python and trying to figure out the best way to match a pattern where the order of two words I'm searching for doesn't matter, but they must be adjacent. So for example, I'm searching for either the phrase "fat cat lasagna co" or "cat fat lasagna co", and I have to imagine there's a better way than just r"\b(fat cat|cat fat) lasagna co\b"

I read this question which addressed a similar problem but the words didn't have to be adjacent and couldn't figure out how to apply it to my problem.

jesseWUT
  • 581
  • 4
  • 14

1 Answers1

1

There is no strictly better solution, but there's an alternative.

Now, if you have two normal words like "fat" and "cat", then (fat cat|cat fat) is undoubtedly the best solution. But what if you have 5 words? Or if you have more complex patterns than just fat and cat that you don't want to type twice?

Say instead of fat and cat you have 3 regex patterns A, B and C, and instead of the space between fat and cat you have the regex pattern S. In that case, you could use this recipe:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}

If you don't have an S, this can be simplified to

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}

(Note: (?:X) can be simplified to X if X doesn't contain an alternation |.)

Example

If we set A = fat, B = cat and S = space, we get:

(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}

Try it online.


Explanation

In essence, we're using capture groups to "remember" which patterns have already matched. To do so, we use this little pattern here:

(?!\1)()some_pattern

What does this do? It's a regex that matches exactly once. Once it has matched, it won't ever match again. If you try to add a quantifier around that pattern like (?:(?!\1)()some_pattern)* it'll match either once or won't match at all.

The trick there is the usage of a backreference to a capture group before that group has even been defined. Because capture groups are initialized with a "failed to match" state, the negative lookahead (?!\1) will match successfully - but only the first time. Because right afterwards, the capture group () matches and captures the empty string. From this point forward, the negative lookahead (?!\1) will never match again.

With this as a building block, we can create a regex that matches fatcat and catfat while only containing the words fat and cat once:

(?:(?!\1)()fat|(?!\2)()cat){2}

Because of the negative lookaheads, each word can only match at most once. Adding a {2} quantifier at the end guarantees that each of the two words matches exactly once, or the entire match fails.

Now we just need to find a way to match a space between fat and cat. Well, that's just a slight variation of the same pattern:

(?:(?!\1)()|\1 )

This pattern will match the empty string on its first match, and on each subsequent match it'll match a space.

Put it all together, and voilà:

(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}

Templates (for the lazy)

2 patterns A and B, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B))){2}

3 patterns A, B and C, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}

4 patterns A, B, C and D, with separator S:

(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C)|(?!\5)()(?:D))){4}

2 patterns A and B, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)){2}

3 patterns A, B and C, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}

4 patterns A, B, C and D, without S:

(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)|(?!\4)()(?:D)){4}
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149