1

Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.

Here's a sample case:

string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']

In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.

groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]

There are similar questions across the site, such as one that addresses matching unique words. The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:

(\w+\b)(?!.*\1\b)
Zach Gates
  • 4,045
  • 1
  • 27
  • 51
  • Regex libraries differ. In Python, you may use PyPi `regex` library, and use [`\b(\w+)\b(?<!(?:.*\b\1\b){2})`](http://regexstorm.net/tester?p=%5cb%28%5cw%2b%29%5cb%28%3f%3c!%28%3f%3a.*%5cb%5c1%5cb%29%7b2%7d%29&i=the+floodwaters+are+rising+along+the+coast). And in .NET, you may use it, too. – Wiktor Stribiżew Aug 31 '17 at 23:16
  • Usually, _unique_ in C++ for example, the order is not preserved. That's because the list has to be sorted first. –  Aug 31 '17 at 23:17
  • And `(\w+\b)(?!.*\1\b)` won't match the words as they first appear. It will match the dups at the end, not the beginning. Your best bet is to split to get all the words, then do your _unique_ from that set doing your own preservation. Regex for this will be unbelievably slowww........ –  Aug 31 '17 at 23:19
  • @WiktorStribiżew: That does, indeed, work with the `regex` library (kudos for the quick suggestion), but I'm looking for something widely (and preferably natively) compatible. I also can't seem to get it working with any online regex tester aside from the one you've linked. – Zach Gates Aug 31 '17 at 23:32
  • There is no "natively compatible" regex. As I said, all regex flavors are different. You can't do what you need with regex alone in the vast majority of cases. – Wiktor Stribiżew Aug 31 '17 at 23:34
  • @WiktorStribiżew: I'm aware that all flavors are different (I only mentioned "natively" in regard to the `regex` library), but it seems that particular expression is very narrowly supported. If you can authoritatively assert that what you've provided is the only way to achieve the intended result, post it as an answer and I'll be happy to accept it. – Zach Gates Aug 31 '17 at 23:44
  • Lookbehind can probably do all kinds of magic, but I'd seriously wonder if it were faster than just matching and filtering out duplicates later. – Bergi Aug 31 '17 at 23:50
  • @sln: That makes sense; Python also has unique, non-ordered `set`s. I'm more interested in learning how an expression like this (if possible at all) would work, rather than using it in production, because, as you said, it would have very poor efficiency. – Zach Gates Aug 31 '17 at 23:51
  • @Bergi: I can't imagine it being *more*, or even equally, efficient but I'd sure enjoy learning exactly how it would compare (if possible). – Zach Gates Aug 31 '17 at 23:54
  • @ZachGates Well you got Wiktor's solution and yours (even if it's just a boring list comprehension not optimised lookup), benchmark them! :-) – Bergi Aug 31 '17 at 23:58
  • It is quicker to split the words into a list, or use a regex `\b\w+\b` to create the list. Traverse the list starting from the head. With each current head item, traverse the list from the end to the current head position, removing all head items. Move on to the next head item, repeat. Try this, it's not exactly a linear operation since the search width decreases as the next head item is checked. –  Sep 05 '17 at 15:03

1 Answers1

2

A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.

However, a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose, I will share the regex that is only supported in .NET and Python PyPi regex library (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).

(?s)\b(\w+)\b(?<!^.*\b\1\b.*\b\1\b)

See the regex demo

The (?s) part is an inline modifier that makes . match all line breaks. You may use regex.DOTALL in Python regex.

Details

  • \b - initial word boundary
  • (\w+) - Group 1: one or more word chars
  • \b - trailing word boundary
  • (?<!^.*\b\1\b.*\b\1\b) - an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, i.e. if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:
    • ^ - start of string
    • .*\b\1\b - any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word
    • .*\b\1\b - same as above (needed to match the captured word, since the lookbehind is used after the consumed word)

The .* in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563