7

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :

final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");

There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :

// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();

// count is 0, wrong
final long count = pattern.splitAsStream("1").count();

I also tried (ab)using an IntStream. The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.

final long count = IntStream
        .iterate(0, i -> matcher.find() ? 1 : 0)
        .limit(100)
        .sum();

I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?

Manos Nikolaidis
  • 21,608
  • 12
  • 74
  • 82
  • 1
    Try to look into `takeWhile`: http://stackoverflow.com/a/20765715/1743880 – Tunaki Dec 30 '15 at 14:58
  • 3
    Splitting != matching. That's why you're getting odd numbers. You should negate your Pattern in order to retrieve the numbers and get what you want. – Flown Dec 30 '15 at 15:05
  • @Tunaki `takeWhile` looks quite interesting. But it will be available in Java 9 apparently, not Java 8. – Manos Nikolaidis Dec 30 '15 at 15:08
  • @Flown I know what `splitAsStream` does and why it doesn't work the way I use it. I just tried your suggestion to negate the pattern and I was surprised to see correct results both for `"1,2,3,4"` and `"1"`. Would you like to post an answer ? – Manos Nikolaidis Dec 30 '15 at 15:19
  • 3
    In Java-9: `matcher.results().count();` – Tagir Valeev Dec 31 '15 at 05:29
  • @Tagir That would be perfect but it's Java 9. I am stuck with while loops until then as I can't get Flown's solution to work for every case – Manos Nikolaidis Dec 31 '15 at 09:56

5 Answers5

4

To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.

final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();
Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
Flown
  • 11,480
  • 3
  • 45
  • 62
3

The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.

The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.

If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas, not the numbers.

    System.out.println("[" + pattern.splitAsStream("1,2,3,4")
            .collect(Collectors.joining("!")) + "]");

prints [!,!,!,]. The odd bit is why it is giving you 4 and not 3.

Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.

A quick demo:

private void test(Pattern pattern, String s) {
    System.out.println(s + "-[" + pattern.splitAsStream(s)
            .collect(Collectors.joining("!")) + "]");
}

public void test() {
    final Pattern pattern = Pattern.compile("\\d+");
    test(pattern, "1,2,3,4");
    test(pattern, "a1b2c3d4e");
    test(pattern, "1");
}

prints

1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]
OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
  • Thanks. I actually know what `splitAsStream` does and why it doesn't work the way I use it. I still don't know how to count matches. Nevertheless, your answer is quite informative and well written so you get a +1. – Manos Nikolaidis Dec 30 '15 at 15:27
3

You can extend AbstractSpliterator to solve this:

static class SpliterMatcher extends AbstractSpliterator<Integer> {
    private final Matcher m;

    public SpliterMatcher(Matcher m) {
        super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
        this.m = m;
    }

    @Override
    public boolean tryAdvance(Consumer<? super Integer> action) {
        boolean found = m.find();
        if (found)
            action.accept(m.groupCount());
        return found;
    }
}

final Pattern pattern = Pattern.compile("\\d+");

Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1

matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4


matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I just tried that and it does produce correct results. It's also very informative! I am not sure it qualifies as a "*simple*" solution! Then I guess, I only have to write `SpliterMatcher` once and reuse it with different matchers. – Manos Nikolaidis Dec 30 '15 at 15:40
  • 1
    There is nothing wrong with creating a new spliterator for each stream—that's what always happens behind the scenes anyway. It’s also the straight forward way of implementing a not yet existing kind of stream and in this regard, it *is* simple, it consist of a single class containing one concrete method and a single delegate object. How much simpler can it be? But when you stream over integers instead of `MatchResult`s, it’s more efficient to implement `Spliterator.OfInt` instead of `Spliterator` and create an `IntStream`. And to ensure reusability, it should report `ORDERED`… – Holger Dec 31 '15 at 11:37
  • And I recommend overriding `forEachRemaining`, if there is a simple, straight-forward implementation possible (as it is the case here). – Holger Dec 31 '15 at 11:39
1

Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?

final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
      .filter(str -> str.matches("\\d+"))
      .count();

where first line can be another way to stream List<String>().stream(), ...

Am I wrong ?

Esta
  • 81
  • 6
  • This requires 2 different regex patterns. 1 for the delimiter and 1 to match data. I would like to avoid that. Otherwise it produces correct results. – Manos Nikolaidis Dec 30 '15 at 15:00
0

Java 9

You may use Matcher#results() to get hold of all matches:

Stream<MatchResult>    results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.

Java 8 and lower

Another simple solution based on using a reverse pattern:

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1

Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split).

See this demo:

String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);    // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563