0

I was looking at "duplicate words" algorithm and I found a solution which was using following regex.

(?i)\\b(\\w+)\\b[\\w\\W]*\\b\\1\\b

I tried to interpret the RegEx by using following site http://public.kvalley.com/regex/regex.asp but I am having hard time .. Can someone breakdown the regex for me and explain it to me ?

Community
  • 1
  • 1
Em Ae
  • 8,167
  • 27
  • 95
  • 162
  • Wouldn't it be better to actually come up with your solution you can understand and maintain? – millimoose Oct 14 '13 at 22:34
  • 2
    Also searching for "explain regexp" would point you to sites like [this one](http://rick.measham.id.au/paste/explain.pl?regex=%28%3Fi%29%5Cb%28%5Cw%2B%29%5Cb%5B%5Cw%5CW%5D*%5Cb%5C1%5Cb), which, honestly, will most likely tell you the same as any answers to this question. (Just make sure to undouble the backslashes, that's a Javaism, not actual RE syntax.) – millimoose Oct 14 '13 at 22:35

1 Answers1

9
 (?i)      - case insensitive flag
 \\b       - word boundary
 (\\w+)    - 1 or more word characters (A-Z, a-z, 0-9) in a captured group
 \\b       - word boundary
 [\\w\\W]* - 0 or more word or non-word characters
 \\b       - word boundary
 \\1       - the group previously captured
 \\b       - word boundary

You may want to look at the Java tutorials for Regular Expressions. All these are explained there.

Multiple uses of Boundary
If you look at the Java tutorial for Boundary Matchers you will see what it is matching, i.e. a boundary of a word. Since this is looking for duplicate words it is making sure that the match is indeed the entire word and not words containing the word.

Case Insensitive
As mentioned by Phsemo this is used so that the \\1 matching group with still match if the case is different. i.e consider if the first word in a sentence was repeated.

Use of [\\w\\W]*
Again as mentioned by Phsemo this is probably used in place of . (which is a regex special character for any character, except this is not guaranteed to match newline characters See this. .* could be used in place of this if the dotall flag (?s) was also included) so that newline characters are matched. And the quantifier of * (0 or more) so that if the next word is a duplicate then it is matched and also if there are words/characters in between the duplicates they are matched.

Java Devil
  • 10,629
  • 7
  • 33
  • 48
  • 2
    Nice and simple +1. Also `[\\w\\W]` was probably used instead of `.` to also match new line character. Also if someone would wonder why `(?i)` is used here is to let `\\1` match group one characters case insensitive. – Pshemo Oct 14 '13 at 22:39
  • Sometimes in languages I see stuff like `/regex/i` for case insensitive. Would those same languages let you use `/(?i)regEx/` as a synonym? What's the difference between these? – Daniel Kaplan Oct 14 '13 at 22:41
  • is it possible if you can explain why `\b` has been used multiple times ? why `[w\W]` is used etc ? – Em Ae Oct 14 '13 at 22:44
  • @tieTYT What languauge's? I can't say I've seen that before, but I'm guessing that it would be language dependant. – Java Devil Oct 14 '13 at 22:58
  • @JavaDevil http://stackoverflow.com/a/5744627/61624 – Daniel Kaplan Oct 14 '13 at 23:07
  • @pshemo Instead of the cryptic `[\\w\\W]*` I prefer `(?is)` at the front then simply `.*`, `(?s)` being the "dotall" switch so dot matches newline. – Bohemian Oct 14 '13 at 23:31
  • @tieTYT No those languages most likely wouldn't let you replace it like you say as it would likely not be recognised but that languauges regex parser - try it in a jsfiddle – Java Devil Oct 14 '13 at 23:38
  • @JavaDevil Doesn't work in javascript: http://jsfiddle.net/xRXQq/ I don't know much about regex, but I'm getting the impression it's fragmented like browser compatibility is. – Daniel Kaplan Oct 15 '13 at 00:09
  • @Bohemian Yes I also prefer combining `(?is)` and using `.` instead of `[\\w\\W]`. I assume that this regex was created by someone who either didn't know about `dotall` flag, or couldn't use it (like in JS, if w3shools is not laying [dot `.`](http://www.w3schools.com/jsref/jsref_regexp_dot.asp) also can't match new line there but JS regex don't have dotall [flag](http://www.w3schools.com/jsref/jsref_obj_regexp.asp) so it seems that `[\\w\\W]` must be used there). – Pshemo Oct 15 '13 at 00:10