33

I am trying to parse standard input and extract every string that matches with a specific pattern, count the number of occurrences of each match, and print the results alphabetically. This problem seems like a good match for the Streams API, but I can't find a concise way to create a stream of matches from a Matcher.

I worked around this problem by implementing an iterator over the matches and wrapping it into a Stream, but the result is not very readable. How can I create a stream of regex matches without introducing additional classes?

public class PatternCounter
{
    static private class MatcherIterator implements Iterator<String> {
        private final Matcher matcher;
        public MatcherIterator(Matcher matcher) {
            this.matcher = matcher;
        }
        public boolean hasNext() {
            return matcher.find();
        }
        public String next() {
            return matcher.group(0);
        }
    }

    static public void main(String[] args) throws Throwable {
        Pattern pattern = Pattern.compile("[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

        new TreeMap<String, Long>(new BufferedReader(new InputStreamReader(System.in))
            .lines().map(line -> {
                Matcher matcher = pattern.matcher(line);
                return StreamSupport.stream(
                        Spliterators.spliteratorUnknownSize(new MatcherIterator(matcher), Spliterator.ORDERED), false);
            }).reduce(Stream.empty(), Stream::concat).collect(groupingBy(o -> o, counting()))
        ).forEach((k, v) -> {
            System.out.printf("%s\t%s\n",k,v);
        });
    }
}
Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
Alfredo Diaz
  • 628
  • 1
  • 6
  • 13
  • 9
    in Java 9, there will be a method Matcher.results. see http://download.java.net/jdk9/docs/api/java/util/regex/Matcher.html#results-- – user140547 Dec 14 '15 at 11:44
  • 2
    looks like the [Java 9 URI has changed](http://download.java.net/java/jdk9/docs/api/java/util/regex/Matcher.html#results--) – Gary Feb 09 '17 at 19:56

3 Answers3

42

Well, in Java 8, there is Pattern.splitAsStream which will provide a stream of items split by a delimiter pattern but unfortunately no support method for getting a stream of matches.

If you are going to implement such a Stream, I recommend implementing Spliterator directly rather than implementing and wrapping an Iterator. You may be more familiar with Iterator but implementing a simple Spliterator is straight-forward:

final class MatchItr extends Spliterators.AbstractSpliterator<String> {
    private final Matcher matcher;
    MatchItr(Matcher m) {
        super(m.regionEnd()-m.regionStart(), ORDERED|NONNULL);
        matcher=m;
    }
    public boolean tryAdvance(Consumer<? super String> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.group());
        return true;
    }
}

You may consider overriding forEachRemaining with a straight-forward loop, though.


If I understand your attempt correctly, the solution should look more like:

Pattern pattern = Pattern.compile(
                 "[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\\.[a-zA-Z0-9-]+)");

try(BufferedReader br=new BufferedReader(System.console().reader())) {

    br.lines()
      .flatMap(line -> StreamSupport.stream(new MatchItr(pattern.matcher(line)), false))
      .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
      .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

Java 9 provides a method Stream<MatchResult> results() directly on the Matcher. But for finding matches within a stream, there’s an even more convenient method on Scanner. With that, the implementation simplifies to

try(Scanner s = new Scanner(System.console().reader())) {
    s.findAll(pattern)
     .collect(Collectors.groupingBy(MatchResult::group,TreeMap::new,Collectors.counting()))
     .forEach((k, v) -> System.out.printf("%s\t%s\n",k,v));
}

This answer contains a back-port of Scanner.findAll that can be used with Java 8.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • You can add the NONNULL characteristic as well. I'm not sure if you can add IMMUTABLE or not; the Matcher documentation is not clear if modifying the underlying CharSequence object (which may be StringBuilder) during the match results in defined behavior. – Jeffrey Bosboom Mar 12 '16 at 00:16
  • 1
    @Jeffrey: indeed, `NONNULL` is possible, `IMMUTABLE` could be specified if the source is a `String` *and* you have full control over the `Matcher` as the `Matcher`’s properties must not be changed as well (most notably its source), but specifying these flags is not that important as currently, no-one makes use of these flags… – Holger Mar 14 '16 at 09:23
  • "unfortunately no support method for getting a stream of matches." I have _never_ understood this omission. The Java designers must have something against this, but who knows what it is. Splitting is not the same, as empty strings in the beginning of the match array are common. Sigh. – Ray Toal Apr 02 '16 at 05:34
  • 6
    @Ray Toal: [there will be](http://download.java.net/jdk9/docs/api/java/util/regex/Matcher.html#results--) in Java 9… – Holger Apr 02 '16 at 07:26
  • @Holger looks like the [Java 9 URI has changed](http://download.java.net/java/jdk9/docs/api/java/util/regex/Matcher.html#results--) – Gary Feb 09 '17 at 19:55
  • 1
    @Gary: I integrated it into the answer, so it’s easier to find. Comments can’t be updated after such a long time, unfortunately. – Holger Feb 10 '17 at 11:17
  • Thanks a lot, I find it incredibly clear and useful ! – MMacphail Apr 17 '17 at 09:30
  • @Gary The URL has changed again. They moved it [to here](https://docs.oracle.com/javase/9/docs/api/java/util/regex/Matcher.html#results--). – MC Emperor Jun 06 '19 at 10:49
  • 1
    @MCEmperor that’s the one I edited into my answer 1½ years ago… – Holger Jun 06 '19 at 11:17
4

Going off of Holger's solution, we can support arbitrary Matcher operations (such as getting the nth group) by having the user provide a Function<Matcher, String> operation. We can also hide the Spliterator as an implementation detail, so that callers can just work with the Stream directly. As a rule of thumb StreamSupport should be used by library code, rather than users.

public class MatcherStream {
  private MatcherStream() {}

  public static Stream<String> find(Pattern pattern, CharSequence input) {
    return findMatches(pattern, input).map(MatchResult::group);
  }

  public static Stream<MatchResult> findMatches(
      Pattern pattern, CharSequence input) {
    Matcher matcher = pattern.matcher(input);

    Spliterator<MatchResult> spliterator = new Spliterators.AbstractSpliterator<MatchResult>(
        Long.MAX_VALUE, Spliterator.ORDERED|Spliterator.NONNULL) {
      @Override
      public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(!matcher.find()) return false;
        action.accept(matcher.toMatchResult());
        return true;
      }};

    return StreamSupport.stream(spliterator, false);
  }
}

You can then use it like so:

MatcherStream.find(Pattern.compile("\\w+"), "foo bar baz").forEach(System.out::println);

Or for your specific task (borrowing again from Holger):

try(BufferedReader br = new BufferedReader(System.console().reader())) {
  br.lines()
    .flatMap(line -> MatcherStream.find(pattern, line))
    .collect(Collectors.groupingBy(o->o, TreeMap::new, Collectors.counting()))
    .forEach((k, v) -> System.out.printf("%s\t%s\n", k, v));
}
dimo414
  • 47,227
  • 18
  • 148
  • 244
2

If you want to use a Scanner together with regular expressions using the findWithinHorizon method you could also convert a regular expression into a stream of strings. Here we use a stream builder which is very convenient to use during a conventional while loop.

Here is an example:

private Stream<String> extractRulesFrom(String text, Pattern pattern, int group) {
    Stream.Builder<String> builder = Stream.builder();
    try(Scanner scanner = new Scanner(text)) {
        while (scanner.findWithinHorizon(pattern, 0) != null) {
            builder.accept(scanner.match().group(group));
        }
    }
    return builder.build();
} 
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76