3

I am trying to find if a string contains only one occurrence of a word ,

e.g.

String : `jjdhfoobarfoo` , Regex : `foo` --> false

String : `wewwfobarfoo` , Regex : `foo` --> true

String : `jjfffoobarfo` , Regex : `foo` --> true

multiple foo's may happen anywhere in the string , so they can be non-consecutive,

I test the following regex matching in java with string foobarfoo, but it doesn't work and it returns true :

static boolean testRegEx(String str){
    return str.matches(".*(foo)(?!.*foo).*");
}

I know this topic may seem duplicate , but I am surprised because when I use this regex : (foo)(?!.*foo).* it works !

Any idea why this happens ?

Arian
  • 7,397
  • 21
  • 89
  • 177
  • The second regex matches the first input string, that is what happens. Although, it will return `false` for the second input example. – jlordo Jun 28 '13 at 23:19
  • But generally the string may not start with `foo` – Arian Jun 28 '13 at 23:23
  • Now the question is edited , `foo` may happen anywhere in the string and the other `foo`'s too – Arian Jun 28 '13 at 23:24

5 Answers5

2

Use two anchored look-aheads:

static boolean testRegEx(String str){
    return str.matches("^(?=.*foo)(?!.*foo.*foo.*$).*");
}

A couple of key points are that there is a negative look-ahead to check for 2 foo's that is anchored to start, and importantly containes an end of input.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
1

You can use this pattern:

^(?>[^f]++|f(?!oo))*foo(?>[^f]++|f(?!oo))*$

It's a bit long but performant.

The same with the classical example of the ashdflasd string:

^(?>[^a]++|a(?!shdflasd))*ashdflasd(?>[^a]++|a(?!shdflasd))*$

details:

(?>               # open an atomic group
    [^f]++        # all characters but f, one or more times (possessive)
  |               # OR
    f(?!oo)       # f not followed by oo
)*                # close the group, zero or more times

The possessive quantifier ++ is like a greedy quantifier + but doesn't allow backtracks.

The atomic group (?>..) is like a non capturing group (?:..) but doesn't allow backtracks too.

These features are used here for performances (memory and speed) but the subpattern can be replaced by:

(?:[^f]+|f(?!oo))*
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • So you mean that there's no general way to do for a longer pattern ? like for `ljkashdflasdfkjhasdflkjhasdlfkjhasdlkfjhasdlfjk` you can not do the same , right ? Please note that `foo` is only an example here – Arian Jun 28 '13 at 23:27
  • @ArianHosseinzadeh: You can do this with the string you want. All that you need is to split the string on the first letter to dynamicly compose your pattern. – Casimir et Hippolyte Jun 28 '13 at 23:28
  • Would you please elaborate what `++|` is ? and why don't you use '.*' anywhere ? – Arian Jun 29 '13 at 00:54
1

If you want to check if a string contains another string exactly once, here are two possible solutions, (one with regex, one without)

static boolean containsRegexOnlyOnce(String string, String regex) {
    Matcher matcher = Pattern.compile(regex).matcher(string);
    return matcher.find() && !matcher.find();
}

static boolean containsOnlyOnce(String string, String substring) {
    int index = string.indexOf(substring);
    if (index != -1) {
        return string.indexOf(substring, index + substring.length()) == -1;
    }
    return false;
}

All of them work fine. Here's a demo of your examples:

    String str1 = "jjdhfoobarfoo";
    String str2 = "wewwfobarfoo";
    String str3 = "jjfffoobarfo";
    String foo = "foo";
    System.out.println(containsOnlyOnce(str1, foo)); // false
    System.out.println(containsOnlyOnce(str2, foo)); // true
    System.out.println(containsOnlyOnce(str3, foo)); // true
    System.out.println(containsRegexOnlyOnce(str1, foo)); // false
    System.out.println(containsRegexOnlyOnce(str2, foo)); // true
    System.out.println(containsRegexOnlyOnce(str3, foo)); // true
jlordo
  • 37,490
  • 6
  • 58
  • 83
1

The problem with your regex is that the first .* initially consumes the whole string, then backs off until it finds a spot where the rest of the regex can match. That means, if there's more than one foo in the string, your regex will always match the last one. And from that position, the lookahead will always succeed as well.

Regexes that you use for validating have to be more precise than the ones you use for matching. Your regex is failing because the .* can match the sentinel string, 'foo'. You need to actively prevent matches of foo before and after the one you're trying to match. Casimir's answer shows one way to do that; here's another:

"^(?>(?!foo).)*+foo(?>(?!foo).)*+$"

It's not quite as efficient, but I think it's a lot easier to read. In fact, you could probably use this regex:

"^(?!.*foo.*foo).+$"

It's a great deal more inefficient, but a complete regex n00b would probably figure out what it does.

Finally, notice that none of theses regexes--mine or Casimir's--uses lookbehinds. I know it seems like the perfect tool for the job, but no. In fact, lookbehind should never be the first tool you reach for. And not just in Java. Whatever regex flavor you use, it's almost always easier to match the whole string in the normal way than it is to use lookbehinds. And usually much more efficient, too.

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
-1

Someone answered the question, but deleted it ,

The following short code works correctly :

static boolean testRegEx(String str){
    return !str.matches("(.*?foo.*){0}|(.*?foo.*){2,}");
}

Any idea on how to invert the result inside the regex itself ?

Arian
  • 7,397
  • 21
  • 89
  • 177
  • 1
    What's the `{0}` for? It doesn't prevent a match with `foo` in it, if that's what you're thinking. In fact, it basically turns the first alternative into a no-op. There are legitimate uses for `{0}`, but this is not one of them. As for inverting the regex, you can wrap it in a negative lookahead, but I don't recommend it: `^(?!(?:(.*?foo.*){0}|(.*?foo.*){2,})$).+$` – Alan Moore Jun 29 '13 at 13:06