20

I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This and is are the same and deletes the is.

Regex

"\\b(\\w+)\\b\\s+\\1"

Any idea why this is happening?

Here is the code that I am using for duplicate removal

public static String RemoveDuplicateWords(String input)
{
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 
    //Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "")
                output = input.replaceFirst(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "")
                output = input.replaceAll(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
    }
    return output;
}
Manos Nikolaidis
  • 21,608
  • 12
  • 74
  • 82
user1190265
  • 203
  • 1
  • 2
  • 5
  • 1
    I believe it should be: \b(\w+)\b\s+\1\b or else it would think 'ice' and 'icecream' are duplicates. – Niall Byrne Feb 05 '12 at 06:11
  • http://rubular.com/r/Qr3twc03RR (I adjusted it again, it looks like a word boundary problem... \b(\w+)\b\s+\b\1\b ) – Niall Byrne Feb 05 '12 at 06:18
  • Adding another word boundary to the end works perfectly for me. But even without that, your regex should never have matched `This is`. Your problem may lie elsewhere, though I can't imagine where that would be. – Alan Moore Feb 05 '12 at 09:08
  • Though you have your answer, you might consider changing your approach. A basic tokenizer and a Set like structure is more understandable and probably more efficient. – M Platvoet Feb 05 '12 at 10:37
  • 2
    The regex is correct now, but you need to double up all those backslashes again. As it is, the code won't even compile. Also, you're doing an amazing amount of unnecessary work. The whole method could be written as `return input.replaceAll("(?i)\\b(\\w+)\\s+\\1\\b", "$1");` – Alan Moore Feb 05 '12 at 10:44
  • @user1190265 : hope the problem is solved... – Fahim Parkar Feb 09 '12 at 13:30

7 Answers7

40

Try this one:

String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);

String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
    input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);

The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:

"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"

\b       match a word boundary
[a-z]+   match a word with one or more characters;
         the parentheses capture the word as a group    
\b       match a word boundary
(?:      indicates a non-capturing group (which starts here)
\s+      match one or more white space characters
\1       is a back reference to the first (captured) group;
         so the word is repeated here
\b       match a word boundary
)+       indicates the end of the non-capturing group and
         allows it to occur one or more times
Freek de Bruijn
  • 3,552
  • 2
  • 22
  • 28
Mina Wissa
  • 10,923
  • 13
  • 90
  • 158
  • 2
    The answer is working perfectly. Though it's been too long, can you please elaborate the regex part? – Vineet Tyagi Jun 23 '17 at 10:50
  • 1
    What is the non-capturing group construct `?:` exactly doing? It seams to make no difference in the results if I delete it. – RichArt Feb 23 '18 at 23:20
  • 1
    I don't know why people are voting other answers: they're getting only one repetition, or whatever alphanumeric string instead of a natural language word. This is the correct solution. Just a note: there is one not required word boundary \b that increments the cost of the regex, i.e: this `\b([a-z]+)(\s\b\1\b)+` takes about 10% less steps in my test: https://regex101.com/r/GVUshn/3 – tuxErrante Jun 19 '18 at 16:59
10

The below pattern will match duplicate words even with any number of occurrences.

Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 

For e-g, "This is is my my my pal pal pal pal pal pal pal pal" will output "This is my pal"

Also, Only one iteration with "while (m.find())" is enough with this pattern.

9

you should have used \b(\w+)\b\s+\b\1\b, click here to see the result...

Hope this is what you want...

Update 1

Well well well, the output that you have is

the final string after removing duplicates

import java.util.regex.*;

public class MyDup {
    public static void main (String args[]) {
    String input="This This is text text another another";
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    System.out.println(m);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "") {
                output = input.replaceFirst(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "") {
                output = input.replaceAll(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
    }
    System.out.println("After removing duplicate the final string is " + output);
}

Run this code and see what you get as output... Your queries will be solved...

Note

In output you are replacing duplicate by single word... Isn't it??

When I put System.out.println(m.group() + " : " + m.group(1)); in first if condition I get output as text text : text i.e. duplicates are replacing by single word.

else
    {
        while (m.find())
        {
            if (output == "") {
                System.out.println(m.group() + " : " + m.group(1));
                output = input.replaceFirst(m.group(), m.group(1));
            } else {

Hope you got now what is going on... :)

Good Luck!!! Cheers!!!

Manos Nikolaidis
  • 21,608
  • 12
  • 74
  • 82
Fahim Parkar
  • 30,974
  • 45
  • 160
  • 276
  • Thanks, I will try that...Regex always kicks my @#$ – user1190265 Feb 05 '12 at 07:48
  • Still does not work, I still get the is in This is removed: \nThis is is an example example of duplicate. using the following code: Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); //Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(input); – user1190265 Feb 05 '12 at 07:58
  • If i do not use double backslash it gives: 54: illegal escape character Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); – user1190265 Feb 05 '12 at 08:35
  • The double backslashes are necessary because the regexes are in the form of Java string literals. And please don't try to take the discussion off site like that. Whatever source code we need should be included in the question where everyone can see it. And @OP, code snippets don't belong in comments, either. Edit your question instead and add the code to it. – Alan Moore Feb 05 '12 at 09:03
  • Changed to use "\\b(\\w+)\\b\\s+\\1\\b" works fine now. I will test more to ensure tomorrow. – user1190265 Feb 05 '12 at 10:06
  • It works for a single line, but when i read in my file with multiple lines of text it strips out too much. When Input = "This is is is is is is is an example example example of duplicate.\nThis is is another another example example." then the output is: "This an example of duplicate. This another example." – user1190265 Feb 05 '12 at 23:54
  • Can some body explain me like how the second 'while' loop specifically works on removing the duplicate words that starts from 'Capital' letters. Eg: in this example i noticed within the first while loop it removes 'text' and 'another' words. But then it moves to second while loop to remove 'This'.. How these two while loops are differentiated to do these specific tasks.. For me both looks same.. – GeorgeT Oct 01 '22 at 02:11
5
\b(\w+)(\b\W+\1\b)*

Explanation:

\b : Any word boundary <br/>(\w+) : Select any word character (letter, number, underscore)

Once all the words are selected, now it's time to select the common words.

( : Grouping starts<br/>
\b : Any word boundary<br/>
\W+ : Any non-word character<br/>
\1 : Select repeated words<br/>
\b : Un select if it repeated word is joined with another word<br/>
) : Grouping ends

Reference : Example

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
imbond
  • 2,030
  • 1
  • 20
  • 22
2

I believe this is the regular expression you should be using to detect 2 consecutive words separated by any number of non-word characters:

Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

if unicodes are important than you should use this:

 Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*",
        Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)
András
  • 3,395
  • 1
  • 21
  • 27
0

Also try with this Regex that find only repeat words

(?i)\\b(\\w+)(\\b\\W+\\b\\1\\b){1,}