1

In the following example sentence:

Green shirt green hat

Is it possible to use regex to detect 2 identical words and replace the second with and to become:

Green shirt and hat


A more difficult string example. Here the first of the identical words needs to be replaced:

You are an artistically gifted musically gifted individual

Should become:

You are an artistically and musically gifted individual

CyberJunkie
  • 21,596
  • 59
  • 148
  • 215
  • 3
    hm, your second phrase is not just another example but widens the scope of your initial statement to *find a regex which can detect and replace the nth word of a sequence of identical words*? – le_m May 30 '16 at 22:52
  • 1
    with your second example you're branching into lexicological parsing which is technically beyond the scope of a regular expression. – Ro Yo Mi May 30 '16 at 22:56
  • @RoYoMi It is however possible with js *regex*: 'You are an artistically gifted musically gifted individual'.replace(/(\b\S+\b)(.+)(\1)\b/gi, 'and$2$1'); – le_m May 31 '16 at 00:25

5 Answers5

7

Description

First off, regex isn't the most ideal solution for this, but I'm sure you have your reasons for using it.

((\b[a-z]{1,}\b).*?)(\b\2\b)(.*)$

Replace with: \1and\4

Regular expression visualization

Summary

This regex will find two identical words in a string and replace the second one with and.

Example

Live Demo

https://regex101.com/r/yG3yM6/2

Sample text

Green shirt green hat
Green shirt greenish hat
You are an artistically gifted musically gifted individual

Sample Matches

Green shirt and hat
Green shirt greenish hat
You are an artistically gifted musically and individual

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
      [a-z]{1,}                any character of: 'a' to 'z' (at least
                               1 times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
    \2                       what was matched by capture \2
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  $                        before an optional \n, and the end of a
                           "line"
----------------------------------------------------------------------

Extra credit

Although not addressed in the OP, if the words in question use non a-z characters, then you could replace [a-z] with [a-z]|[^\x00-\x7F] which will match non-english characters. But then we'll need to change the \b\2\b to (?<=\s|^)\2(?=\s|$) so we can ensure correct matching.

((\b(?:[a-z]|[^\x00-\x7F]){1,}\b).*?)((?<=\s|^)\2(?=\s|$))(.*)$

Regular expression visualization

Live Demo https://regex101.com/r/wD8yF5/2

Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Thank you for the working code and explanation! What would be more ideal if not regex? – CyberJunkie May 30 '16 at 22:27
  • 1
    anyone that has to ask "can this be done with a regex", probably doesn't have a reason to use a regex... – jmoreno May 30 '16 at 22:27
  • 1
    @jmoreno, on the surface I agree, but I think asking the question "can this be done with regex" is fair because there are a lot of problems that can be solved with regex, although the complexity of the resulting expressions may make it difficult for an layperson to maintain. And those who don't understand Regex are the first to poo-poo their use – Ro Yo Mi May 30 '16 at 22:33
  • 1
    @CyberJunkie, given short strings like this, a regular expression may work well. However to properly answer your question I'd need to see more examples of your more difficult strings. – Ro Yo Mi May 30 '16 at 22:33
  • @RoYoMi I added one more example to my question. Using the code provided, it would replace the second word which will not read correctly: `You are an artistically gifted musically and individual` – CyberJunkie May 30 '16 at 22:42
  • 1
    @CyberJunkie _"I added one more example to my question."_ Has original Question http://stackoverflow.com/revisions/37534118/1 been resolved? – guest271314 May 30 '16 at 23:05
  • Your regex fails for e.g. *'Green shirt greenish hat'* which it replaces with *'Green shirt andish hat'*. Also *'A green shirt green hat'* becomes *'A green shirt green handt'*. Add word boundary checks to the backreference. – le_m May 31 '16 at 00:42
  • Good find, i"ve corrected the expression to prevent that edge case – Ro Yo Mi May 31 '16 at 01:14
  • Still has an issue with non a-z characters in words: *'Naïve question naïve answer'* becomes *'Naïve question andve answer'*. Using \S+ helps but needs further changes to your regex. – le_m May 31 '16 at 20:32
  • It's a good point, however since you're picking at edge cases, then `\S+` would allow numeric, and all symbols... well all characters that are not white space. I've updated the answer to cover non a-z characters. – Ro Yo Mi May 31 '16 at 21:27
2

By modifying this answer, you can do it:

console.log( myFunc("Green shirt green hat") );
console.log( myFunc("Big red eyed rabbits red Ferrari") );

function myFunc(str) {
    return str.replace(/\b(\w+)(.+)(\1)\b/gi, "$1$2and");
}
Community
  • 1
  • 1
blex
  • 24,941
  • 5
  • 39
  • 72
  • Better add `\b` after the first capturing group and before the backreference - otherwise see what happens with the string `alpha` – Sebastian Proske May 30 '16 at 22:42
  • Your regex replaces the *last* identical word by 'and', not necessarily the *second*: `'Green shirt green hat green gloves'.replace(/\b(\w+)(.+)(\1)\b/gi, "$1$2and"); // Green shirt green hat and gloves` – le_m May 31 '16 at 00:08
1

You can use RegExp /(\bgreen\b)/ig, where green is word to match, String.prototype.replace(), when p2 is reached within replacement function

p1, p2, ... The nth parenthesized submatch string, provided the first argument to replace() was a RegExp object. (Corresponds to $1, $2, etc. above.) For example, if /(\a+)(\b+)/, was given, p1 is the match for \a+, and p2 for \b+.

replace green with and

var str = "Green shirt green hat green";
var re = function(m, p1, p2, index) {
  return p2 ? "and" : m
}
str = str.replace(/(\bgreen\b)/ig, re);
console.log(str);
guest271314
  • 1
  • 15
  • 104
  • 177
  • 1
    This only seems to work because the first "Green" is capitalized and thus not matched by "green". Your regex doesn't match the second of two identical words. – le_m May 30 '16 at 22:26
  • @le_m _"This only seems to work because the first "Green" is capitalized and thus not matched by "green". Your regex doesn't match the second of two identical words."_ Well, this is the string provided at original Question. Returns expected results described at OP. Changing string at OP? "Green" and "green" are different words; the Question could also be described as matching the first occurrence of "green" beginning with lowercase "g". What do you suggest? Matching both `"G"` and `"g"`? – guest271314 May 30 '16 at 22:28
  • 3
    OP wants a general regular expression to *"detect 2 identical words and replace the second"*. The given phrase is only an example. Also, /green{1,}/ matches 'green', 'greenn', 'greennn' and so on, probably not what OP wanted. – le_m May 30 '16 at 22:33
  • 3
    @guest271314 given this is a question answer site for programing, I'm sure the OP was looking for something more scalable. With 35k earned points on 2,719+ answers I'm sure you've seen quite a few questions that were ummm... lacking in content or sample text. – Ro Yo Mi May 30 '16 at 22:35
  • @RoYoMi See updated post. _"I'm sure you've seen quite a few questions that were ummm... lacking in content or sample text."_ Yes. Given original string, original stacksnippets met requirement. – guest271314 May 30 '16 at 22:45
  • Your revised answer is good. Perhaps you could move the 'global' i into a closure? – le_m May 30 '16 at 22:48
0

You can use the following:

/(\b([^\s]+)\b.*?)\b\2\b/gi

Test case:

var regex = /(\b([^\s]+)\b.*?)\b\2\b/gi;
'Green shirt green hat with blue shoes blue glasses'.replace(regex, '$1and')
  === 'Green shirt and hat with blue shoes and glasses';
'Orange colored oranges orange belts'.replace(regex, '$1and')
  === 'Orange colored oranges and belts';

Try it online

Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
0

The answer to your first example - which I read as replace the second of the first repeated word with 'and' - is:

var str = 'Green shirt green hat';

str = str.replace(/(\b\S+\b)(.+?)(\b\1\b)/i, '$1$2and');

console.log(str);

The answer to your second example - which I read as replace the first repeated word with 'and' - is:

var str = 'You are an artistically gifted musically gifted individual';

str = str.replace(/(\b\S+\b)(.+?)(\b\1\b)/i, 'and$2$1');

console.log(str);
le_m
  • 19,302
  • 9
  • 64
  • 74