1

Example text: In the park, child plays. Child is tall. Child watches another child at play.

I want to match "child" in the first sentence, "Child" in the second and third sentences but NOT "child" in the third sentence. Or in other words, match "Child" or "child" but not if proceeded by the word "another"

I thought I could do it using negative look behind

 ((?<\!another) [Cc]hild)

but can't seem to get the syntax correct to produce a valid regexp.

Even if I could get the syntax right I am not sure I can do it in GWT. Here is a snippet from the GWT Javadoc

Java-specific constructs in the regular expression syntax (e.g. [a-z&&[^bc]], (?<=foo), \A, \Q) work only in the pure Java implementation, not the GWT implementation,...

Any help or insight would be appreciated.

update:

Colin's answer almost works but isn't quite right.

Colin's regex does match "Child" and "child" and not match "another child" like I asked. There are a few problems though.

What I am trying to do is match on "Child" and "child" so they can be replaced with either the child's name or the correct pronoun he/she, depending on the child's gender.

The problem with Colin's regex is that it matches ", child" and ". Child". Is also doesn't match "Child" if that is the first word in the text. For example:

"Child went to the park. In the park, child plays. Child is tall. Child watches another child at play."

The first Child does not match. The subsequent matches are on ", child", ". Child", and ". Child".

I worked on the regex that Colin came up with trying to get it to just match "child" or "Child" but can't make it work.

nlv
  • 141
  • 2
  • 10

2 Answers2

1

The regex in GWT has the same level of support as RegExp JavaScript, since it just calls on to the native JavaScript classes.

I can't think of a way to reject "another child" directly in the regex, given that JavaScript regex doesn't have support for look-behind or possessive quantifier.

Therefore, I will write a regex so that, if "another" appears before "child", then "another" will definitely be matched; otherwise, only "child" will be matched. You can then filter out the matches that have more than 5 characters.

RegExp.compile("(?:another +)?[Cc]hild", "g")

Note that "child" in the string "some children" will also be matched. And if "another" is embedded inside a longer word string, for example "ranother"1, then we will blindly pick up the fragment. To prevent such cases, we need to add word boundary check \b2:

RegExp.compile("(?:\\banother +)?\\b[Cc]hild\\b", "g")
                   ---           ---        ---
                    |             |          |
            prevent "ranother"  prevent "children"
              from matching        or "nochild"
                                  from matching

You may also allow case-insensitive matching (which is quite reasonable for text) with i flag. However, I will leave it up to you to decide.

Using the regex above, we will always match "another child" before matching "child". Therefore, when the match only contains "child", we know that "another" does not precede it. Therefore, we can filter away the matches with length > 5, and we are left with only the valid strings.

Footnote

  1. I use a made up word as an example. It is perfectly normal in arbitrary string, but I don't know if there is any word in English with "another" embedded inside.

  2. There is a caveat here. "child4" or "child_something" will not be matched when \b is used. While the "another" in "_another child" or "5another child" will not be picked up by the regex (and only "child" is matched, which means you accept the match). It is possible to workaround this, and I will do it if you request for it.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • This does not match at all on the text I gave as an example. It doesn't match on anything. Colin's answer is close but not exactly working the way I need it to. – nlv Feb 16 '13 at 17:01
  • I tested your regex that has the word boundary and it didn't match on anything. I just realized that it isn't working because you escaped the word boundary "\\b" when it should be "\b". It works now on matching Child, child, and "another child". I have to think about fitlering matches with length>5 if that would work for me. – nlv Feb 16 '13 at 17:28
  • @nlv: I find it strange that `\b` works, since `\b` in Java string is backspace character. On that reasoning, `\\b` should be interpreted internally as forward slash + `b`, which is then fed to RegExp object. – nhahtdh Feb 16 '13 at 18:10
  • Forward slash + `b`? Don't you mean **backslash** + `b`? And isn't the `compile` method deprecated? Why use that instead of the `RegExp` constructor? – Alan Moore Feb 16 '13 at 20:29
  • @AlanMoore: I confuse the names of the slash, whatever that is for ``\``. And how is it deprecated? http://google-web-toolkit.googlecode.com/svn/javadoc/2.1/com/google/gwt/regexp/shared/RegExp.html – nhahtdh Feb 16 '13 at 20:34
  • I'm going by [this](http://stackoverflow.com/a/884784/20938) and [this](https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Deprecated_and_obsolete_features?redirectlocale=en-US&redirectslug=JavaScript%2FReference%2FDeprecated_Features#RegExp_Properties). – Alan Moore Feb 16 '13 at 21:26
  • 1
    Alan, the Java/GWT Regex class ends up getting compiled out to JavaScript, and the static method being used here ends up just being a call to the constructor you linked - see http://code.google.com/p/google-web-toolkit/source/browse/trunk/user/src/com/google/gwt/regexp/super/com/google/gwt/regexp/shared/RegExp.java?r=7517#33 for how that is implemented. – Colin Alworth Feb 16 '13 at 23:18
-1

match "Child" or "child" but not if proceeded by the word "another"

([^(?:another)] [Cc]hild)

This captures a group that doesn't start with another (using the negated character set of the non-capture group), then a space, then the word child, capitalized or not. Is the space a requirement? You had it in your original, and it is present in all four test cases in your example. Making this slightly more useful (what are you actually trying to capture?), starting the only capture group around child:

[^(?:another)] ([Cc]hild)

Using MDN documentation on supported browser regex features: https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions

Test case:

public void testHomeworkRegex() {
  String sample = "In the park, child plays. Child is tall. Child watches another child at play.";
  RegExp regex = RegExp.compile("[^(?:another)] ([Cc]hild)", "g");//using global flag to match multiple times

  MatchResult result1 = regex.exec(sample);
  assertNotNull(result1);
  assertEquals("child", result1.getGroup(1));

  MatchResult result2 = regex.exec(sample);
  assertNotNull(result2);
  assertEquals("Child", result2.getGroup(1));

  MatchResult result3 = regex.exec(sample);
  assertNotNull(result3);
  assertEquals("Child", result3.getGroup(1));


  MatchResult result4 = regex.exec(sample);
  assertNull(result4);
}
Colin Alworth
  • 17,801
  • 2
  • 26
  • 39
  • I don't think you know what you are doing with the regex. Try with the string `Child for child no child`. You will fail to match the last 2. – nhahtdh Feb 16 '13 at 02:18
  • This regex works better than nhahtdd's but there is still problems with it. It is matching on the period or comma and the space in front of child. It also does not match Child if that is the first word in the text. nhahtdh is right that this regex does not match at all on the string he gave. – nlv Feb 16 '13 at 17:12
  • 1
    If this regex seems to work better, it's because you're not testing it rigorously. `[^(?:another)]` matches exactly one character, which can be anything except `(`, `?`, `:`, `a`, `n`, `o`, `t`, `h`, `e`, `r`, or `)`. It only looks at the last character of the preceding word, and if it's one of the characters listed above, the match will fail. – Alan Moore Feb 16 '13 at 21:16
  • Alan Moore, thanks, you're right, I clearly am misremembering how the character set works. I had assumed that because you easily built a FSM that doesn't accept `child` after `another`, that I could use this syntax to describe that same machine in regex. Thanks for pointing this out, I'll hopefully either update this answer or add a note pointing out my error. – Colin Alworth Feb 16 '13 at 23:10