13

I need to write a regex, that would identify a word that have a repeating character set at the end. According to the following code fragment, the repeating character set is An. I need to write a regex so this will be spotted and displayed.

According to the following code, \\w will match any word character (including digit, letter, or special character). But i only want to identify english characters.

String stringToMatch = "IranAnAn";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
    System.out.println("Word contains duplicate characters " + m.group(1));
}

UPDATE

Word contains duplicate characters a
Word contains duplicate characters a
Word contains duplicate characters An
Sharon Watinsan
  • 9,620
  • 31
  • 96
  • 140
  • 3
    How many characters is considered "repeating"? Do you want to flag `banana` (although it's a valid word) and `mama` (only repeating sets). How about `zoo` - repeating a single character - or `tomtom` (repeating three characters). If you want a match of "just English characters", use `[A-Za-z]` for the character to match. – Floris Jul 22 '13 at 17:47

2 Answers2

9

You want to catch as many characters in your set as possible, so instead of (\\w) you should use (\\w+) and you want the sequence to be at the end, so you need to add $ (and I have removed the + after \\1 which is not useful to detect repetition: only one repetition is needed):

Pattern p = Pattern.compile("(\\w+)\\1$");

Your program then outputs An as expected.

Finally, if you only want to capture ascii characters, you can use [a-zA-Z] instead of \\w:

Pattern p = Pattern.compile("([a-zA-Z]+)\\1$");

And if you want the character set to be at least 2 characters:

Pattern p = Pattern.compile("([a-zA-Z]{2,})\\1$");
assylias
  • 321,522
  • 82
  • 660
  • 783
  • No it doesn't work. i have added the output i have given. It also detects the `a` as well. I only want to detect consecutive characters. – Sharon Watinsan Jul 22 '13 at 17:53
  • @sharonHwk I'm not sure I understand. I thought that with the input `IranAnAn`, you expected to find `An` - is that not what you meant? In your update, why would `a` be considered as a repeating character? – assylias Jul 22 '13 at 17:56
  • I only want it to output when it detects a repeating `An`. But it outputs when it detects a repeating `a`. – Sharon Watinsan Jul 22 '13 at 17:57
  • 2
    @sharonHwk Maybe instead `+` try using `{2,}` – Pshemo Jul 22 '13 at 17:58
  • Hmm yes, that seemed to work. But can you tell me what `{2,}` means? Thanks – Sharon Watinsan Jul 22 '13 at 17:59
  • @sharonHwk It means the length of the repeating character set must be 2 or more. If you want to ignore single characters that are repeated, then that is the way. – assylias Jul 22 '13 at 18:01
  • 1
    @sharonHwk `+` means that element can repeat one or more times, `{2}` means exactly two times, `{1,4}` means between one or four times, `{2,}` means two or more times, or in other words at least two times. More info at http://www.regular-expressions.info/repeat.html – Pshemo Jul 22 '13 at 18:01
  • `{2,}` matches at least two characters, while `+` matches at least one character – Michael Lang Jul 22 '13 at 18:01
1

If by "only English characters" you mean A-Z and a-z, the follow regex will work:

".*([A-Za-z]{2,})\\1$"
Michael Lang
  • 3,902
  • 1
  • 23
  • 37
  • No it doesn't work. i have added the output i have given. It also detects the `a` as well. I only want to detect consecutive characters. – Sharon Watinsan Jul 22 '13 at 17:54
  • 2
    In [unicode table](http://unicode-table.com/en/#0060) between `A` and `z` there are also other characters like `[` \ `]` `^` `_` `\``. `[A-Za-z]` is more precise. – Pshemo Jul 22 '13 at 17:56