Writing a regex to detect repeat-characters

Question

I need to write a regex, that would identify a word that have a repeating character set at the end. According to the following code fragment, the repeating character set is An. I need to write a regex so this will be spotted and displayed.

According to the following code, \\w will match any word character (including digit, letter, or special character). But i only want to identify english characters.

String stringToMatch = "IranAnAn";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
    System.out.println("Word contains duplicate characters " + m.group(1));
}

UPDATE

Word contains duplicate characters a
Word contains duplicate characters a
Word contains duplicate characters An

How many characters is considered "repeating"? Do you want to flag `banana` (although it's a valid word) and `mama` (only repeating sets). How about `zoo` - repeating a single character - or `tomtom` (repeating three characters). If you want a match of "just English characters", use `[A-Za-z]` for the character to match. — Floris, Jul 22 '13 at 17:47

assylias · Accepted Answer · 2013-07-22T18:01:56.867

9

You want to catch as many characters in your set as possible, so instead of (\\w) you should use (\\w+) and you want the sequence to be at the end, so you need to add $ (and I have removed the + after \\1 which is not useful to detect repetition: only one repetition is needed):

Pattern p = Pattern.compile("(\\w+)\\1$");

Your program then outputs An as expected.

Finally, if you only want to capture ascii characters, you can use [a-zA-Z] instead of \\w:

Pattern p = Pattern.compile("([a-zA-Z]+)\\1$");

And if you want the character set to be at least 2 characters:

Pattern p = Pattern.compile("([a-zA-Z]{2,})\\1$");

edited Jul 22 '13 at 18:01

answered Jul 22 '13 at 17:45

assylias

321,522
82
660
783

No it doesn't work. i have added the output i have given. It also detects the `a` as well. I only want to detect consecutive characters. – Sharon Watinsan Jul 22 '13 at 17:53
@sharonHwk I'm not sure I understand. I thought that with the input `IranAnAn`, you expected to find `An` - is that not what you meant? In your update, why would `a` be considered as a repeating character? – assylias Jul 22 '13 at 17:56
I only want it to output when it detects a repeating `An`. But it outputs when it detects a repeating `a`. – Sharon Watinsan Jul 22 '13 at 17:57
2

@sharonHwk Maybe instead `+` try using `{2,}` – Pshemo Jul 22 '13 at 17:58
Hmm yes, that seemed to work. But can you tell me what `{2,}` means? Thanks – Sharon Watinsan Jul 22 '13 at 17:59
@sharonHwk It means the length of the repeating character set must be 2 or more. If you want to ignore single characters that are repeated, then that is the way. – assylias Jul 22 '13 at 18:01
1

@sharonHwk `+` means that element can repeat one or more times, `{2}` means exactly two times, `{1,4}` means between one or four times, `{2,}` means two or more times, or in other words at least two times. More info at http://www.regular-expressions.info/repeat.html – Pshemo Jul 22 '13 at 18:01
`{2,}` matches at least two characters, while `+` matches at least one character – Michael Lang Jul 22 '13 at 18:01

Michael Lang · Answer 2 · 2013-07-22T18:02:21.360

1

If by "only English characters" you mean A-Z and a-z, the follow regex will work:

".*([A-Za-z]{2,})\\1$"

edited Jul 22 '13 at 18:02

answered Jul 22 '13 at 17:51

Michael Lang

3,902
1
23
37

No it doesn't work. i have added the output i have given. It also detects the `a` as well. I only want to detect consecutive characters. – Sharon Watinsan Jul 22 '13 at 17:54
2

In [unicode table](http://unicode-table.com/en/#0060) between `A` and `z` there are also other characters like `[` \ `]` `^` `_` `\``. `[A-Za-z]` is more precise. – Pshemo Jul 22 '13 at 17:56

Writing a regex to detect repeat-characters

2 Answers2

Linked