2

I checked several posts related to removing duplicated words (in my case word means a sub-string separated by a space) in javascript in a String. The following one RegEx: /(\b\S+\b)(?=.*\b\1\b)/g is among the ones I found on the internet that matches almost all cases but it produces some mismatches that I am not able to find out why. For example, it removes some characters such as: , /- in situations where it is part of the string (not reached a blank yet). I guess it has to be with the word boundary metacharacter \b but I am not able to find a solution for that.

For example, I have the following string samples:

123-1 123-2 test-1 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b a/b
Inv 50049 50049 Inv 50195 PrjPAN02
Inv 51360-1, 51366-7; 51372 Inv 51360-1, 51366-7; 372 PrjPAN02
Inv 51360-1, 51366-7; 51372 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020

and it generates the following output. You can test it here: Regex101

1232  test-1 we 1004/20
Company we 0906/20 083020-090620
ab /01
test_1 test_2
 a/b
  50049 Inv 50195 PrjPAN02
 , ; 51372 Inv 513601, 51366-7; 372 PrjPAN02
 513601, ;  51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017,  55001, 55022, 55025
55254, 61 5524666,69
55733, 41, 44 55727, 45,48
57269, 7174,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
we 09/20 091320-092020

I would expect the following output:

123-1 123-2 test-1 w/e 10/04/20
Company w/e 09/06/20 083020-090620
a/b 01/01
test_1 test_2
a/b
50049 Inv 50195 PrjPAN02
51372 Inv 51360-1, 51366-7; 372 PrjPAN02
51360-1, 51372 Inv 513601, 51366-7; 372 PrjPAN02
55009, 55017, 55022 55001, 55022, 55025
55254, 61 55246,66,69
55733, 41, 44 55727, 45,48
57269, 71,74,75, 57354 57266, 73
57437, 38, 41, 43 57434, 40
w/e 09/20/20 091320-092020

I would expect that every repeated string delimited by space would be removed, but the ReEx removes the slash (/) and hyphen (-) and comma (,) in some cases inside strings that are delimited by space.

I checked the following similar question, to try to find regular expressions that would match all the cases:

  1. Javascript RegExp + Word boundaries + unicode characters
  2. Remove duplicate words in a string using Regex JS [duplicate]
  3. Regular expression to find and remove duplicate words
David Leal
  • 6,373
  • 4
  • 29
  • 56
  • 2
    Please add your desired outcome, what about `123-1 123-2` gets substituted to `1232` ... ? – bobble bubble Dec 07 '21 at 00:04
  • The \b is zero length, so the \b matches the BEFORE the - or /, but then the - or / will match the \S (non-whitespace). So, the / itself is duplicated and matches the expression. Perhaps use \s instead of \b – Garr Godfrey Dec 07 '21 at 00:09
  • you could probably kludge a pretty reliably fix using: `/(\b[/-,]*\S+\b)(?=.*\b\1\b)/g – Garr Godfrey Dec 07 '21 at 00:13
  • `(\b\S+\b)` matches `'-'` in `'a-b'`, for example. I suggest you use `(\b([a-zA-Z]+)\b)(?=.*\b\1\b)` if it is in fact duplicated words you wish to remove. – Cary Swoveland Dec 07 '21 at 01:16
  • Your regex suggests that when you say 'word' you do not mean it in the linguistic sense. Please edit to define 'word'. – Cary Swoveland Dec 07 '21 at 01:37
  • Corrected the initial sample, the result and added the expected result, as well as what I mean by word (not linguistic word). In the above example I want to remove every repeated string delimited by space – David Leal Dec 07 '21 at 03:41

1 Answers1

3

Word boundaries do not work here. Use

/(?<!\S)(\S+)(?!\S)(?=.*(?<!\S)\1(?!\S))/g

EXPLANATION

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
      \S                       non-whitespace (all but \n, \r, \t,
                               \f, and " ")
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
    \1                       what was matched by capture \1
--------------------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
      \S                       non-whitespace (all but \n, \r, \t,
                               \f, and " ")
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
  )                        end of look-ahead
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • Your regex matches the first comma in `', b ,'`, though I'm not sure if that's a mistake. David, can you clarify the question? – Cary Swoveland Dec 07 '21 at 01:28
  • Thanks @RyszardCzech it works I tested on my real example, I will do additional testing before selecting your response as a valid one. I guess it can be simplified (but is a matter of preference) instead of using two negations to use `\s` to find space to look behind and ahead, for example: `/(?<=\s)(\S+)(?=\s)(?=.*(?<=\s)\1(?=\s))/g` I guess I am getting the same result, is it correct? – David Leal Dec 07 '21 at 05:43
  • @DavidLeal Just to point out, that this patterns with `\S` in a teststring like `Inv 51360-1, 57434, 51360-1;` with duplicate `51360-1` but one ends with comma and the other with semicolon - would not match `51360-1,` because it looks up the this string with comma ([demo](https://regex101.com/r/K4EQEN/2)). However, this might be the desired behavior, if not wanted you need to clearify this in the Question. – bobble bubble Dec 07 '21 at 11:47
  • 1
    @bobble bubble yes that correct `513660-1,` and `513660-1;` are different as I mentioned in the question words are considered as sub-strings delimited by space, not by comma or semi-colon. If for such a particular case I want to consider a match, I prefer to do a clean-up first, to standardize the punctuation in order to keep the same condition to match. – David Leal Dec 07 '21 at 13:30
  • 1
    @DavidLeal `(?<=\s)` and `(?=\s)` do not account for beginning and end of the string. `(?<!\S)` and `(?!\S)` do. – Ryszard Czech Dec 07 '21 at 22:14
  • Here can be tested the solution provided by @RyszardCzech: https://regex101.com/r/5j5d3e/2/ – David Leal Dec 07 '21 at 22:42
  • Thanks, @RyszardCzech for the sample I posted I verified it worked, from the top of your head is there any situation where your solution works and `/(?<=\s)(\S+)(?=\s)(?=.*(?<=\s)\1(?=\s))/g` doesn't. Thanks – David Leal Dec 07 '21 at 22:50
  • 1
    @DavidLeal Here is one: https://regex101.com/r/5j5d3e/3 – Ryszard Czech Dec 07 '21 at 22:55
  • @RyszardCzech I am curious, I know I didn't ask about it. Is there a way to remove the second occurrence of the repeated sub-string instead of the first one (from left to right)? Is that possible? Your solution removes the first occurrence, for example: `Hello my friend Hello` returns: `my friend Hello` instead of: `Hello my friend`. Thanks – David Leal Dec 07 '21 at 22:56
  • 1
    @DavidLeal Yes. `(?<!\S)(\S+)(?!\S)(?<=(?:.*(?<!\S)\1(?!\S)){2})`. [Proof](https://regex101.com/r/5j5d3e/4). – Ryszard Czech Dec 07 '21 at 22:59