12

Is it possible to use a regular expression to match all words but match unique words only once? I am aware there are other ways of doing this however I'm interested in knowing if this is possible with the use of a regular expression.

For example I currently have the following expression:

(\w+\b)(?!.*\1)

and the following string:

glass shoes door window door glasses. window glasses

For the most part the expression works and matches the following words:

shoes
door 
window
glasses

There are two issues with this:

  1. A match for a substring is being made on "glasses" with "glass", this is incorrect.

  2. "glasses" and "glasses." should match but currently do not.

The final match should be:

shoes 
door 
window 
glasses 
glass 
Isomorph
  • 341
  • 1
  • 3
  • 9

3 Answers3

11

Pretty close, just readd the \b in the negative lookahead

/(\w+\b)(?!.*\1\b)/

See it on Rubular

maček
  • 76,434
  • 37
  • 167
  • 198
  • I'm wondering why `\1` apparently doesn't match the `\b` from the first group. Shouldn't `\1` match everything inside the group and not only `\w+`? – pemistahl Dec 27 '12 at 21:39
  • @maček Wow! Thank you so much, I have pretty much spent two days straight trying to figure this out! – Isomorph Dec 27 '12 at 21:42
  • @PeterStahl The reason why you have to add `\b` is to ensure that the negative lookahead matches a whole word and not just a substring, by default it matches substrings. – Isomorph Dec 27 '12 at 23:01
  • Just a note: `\b` is word boundary according to the word characters defined in `\w`, so there will be no word boundary around `_` in `apple_apple`. – nhahtdh Dec 28 '12 at 03:53
  • @pemistahl: `\1` matches whatever _characters_ are found in the 1st expression, but not the assertions. – Titus Aug 13 '16 at 16:03
  • @macek its not working in python. even i can not question. – GolamMazid Sajib Apr 12 '20 at 16:11
  • @WiktorStribiżew closes my question 2 times. i need help. is it possible in python? – GolamMazid Sajib Apr 12 '20 at 16:11
  • I want to learn to build such regular expressions. I know the basics but this is from a different planet. Please help me where to start. – ashish zarekar Aug 11 '23 at 11:25
3

For search distinct words in multiline text use [\s\S] instead of .

(\b\w+\b)(?![\s\S]*\b\1\b)
2

Exactly as maček's answer, but with an extra \b before the back-reference, otherwise if you had

glass shoes door window door glasses. window glasses sunglasses

You'd miss out a match for glasses as it finds it in the word sunglasses.

/(\w+\b)(?!.*\b\1\b)/

Community
  • 1
  • 1
kevatron400
  • 65
  • 1
  • 6