1

Problem

I am trying to to extract words from input

Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)

I tried doing that online and my pattern (\w\s?&?\s?\(?\)?) seems to work.

But when I write my Java program, it is not finding it

private static void findWords() {
    final Pattern PATTERN = Pattern.compile("(\\w\\s?&?\\s?\\(?\\)?)");
    final String INPUT = "Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)";

    final Matcher matcher = PATTERN.matcher(INPUT);
    System.out.println(matcher.matches());
}

It returns False

Question

  1. Why is there a mismatch, seems like my understanding is poor here
  2. How can I get the words out as groups, meaning Pacific Gas & Electric (PG&E) as match group1 and so on
daydreamer
  • 87,243
  • 191
  • 450
  • 722
  • 1
    FYI: The differences between the [`java.util.regex.Matcher`](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html) functions ([`matches()`](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#matches()), [`find()`](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#find()), and [`lookingAt()`](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html)) are listed under "Flavor-Specific Information" in the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496). – aliteralmind Apr 10 '14 at 16:57

4 Answers4

4

If you use Matcher#find() method instead of Matcher#matches() method, you'll get true as outcome. The reason being, the matches() method assumes implicit anchors - carat (^) and dollar ($) at the ends. So it would match the entire string with the regex. If that is not the case, it will return false.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
3

You might want to re-evaluate the output you're getting from rubular.

from Documentation

The matches method attempts to match the entire input sequence against the pattern.

What you have there in rubular finds a bunch of matches because just about every character is a match.

nowhere in your rubular result will it tell you that the entire string is a match though. I'd re-evaluate the results you're seeing there.


and a regular expression to match words is extremely simple

you can use

\b\S*\b 

http://rubular.com/r/ljYs1xO1Qh

or simply

\S*

http://rubular.com/r/xgEuGse1lc

depending on your needs

2

Matcher#matches returns only true if the whole string matches the regular expression.

As you can see in your online matcher, your regex matches not the whole string but a single character (sometimes a bit more). So your regex matches "P" and "a" and "c" and "i" and so on. You should fix your regex first and then use Matcher#find() and Matcher#group() to get the matching groups.

atamanroman
  • 11,607
  • 7
  • 57
  • 81
0

If you want to get the matches out of your string, here this is you can try:

final String INPUT = "Pacific Gas & Electric (PG&E), San Diego Gas & Electric (SDG&E), Salt River Project (SRP), Southern California Edison (SCE)";
Pattern pattern = Pattern.compile("(.*?\\([^)]+\\))(?:,\\s*|$)");
Matcher m = pattern.matcher(INPUT);
while (m.find()) {
    System.out.println(m.group(1));
}

Alternately, you can do INPUT.split("\\s*,\\s*"); if the names doesn't contain any comma inside.

Now come to the question Why is there a mismatch, seems like my understanding is poor here: Because the matches() of String class perform matching over the whole string.

Sabuj Hassan
  • 38,281
  • 14
  • 75
  • 85