2

This is not a duplication of Java String split removed empty values, which deals with split() method returning a new array. In this case I would like to avoid the array.

I solved this problem with a workaround, which I am posting below as a possible solution to my question.

My goal is to process all lines including empty strings such as the following example:

String input = "foo\nbar\n\n\nzul\n\n\n";
Pattern NEWLINE = Pattern.compile("\\R");
int [] count = {1};
NEWLINE
    .splitAsStream(input)
    .forEach(line -> System.out.println(count[0]++ + ": " + line));

which produces:

1: foo
2: baz
3: 
4: 
5: zul

Yet, it is missing:

6: 
7:

How to include last empty lines?

rustyx
  • 80,671
  • 25
  • 200
  • 267
Miguel Gamboa
  • 8,855
  • 7
  • 47
  • 94

3 Answers3

2

You can use a lookahead (?=(\\R)) to ensure that the \\R delimiter is not consumed and then remove it yourself with String.trim().

String input = "foo\nbar\n\n\nzul\n\n\n";
Pattern NEWLINE = Pattern.compile("(?=(\\R))");
int[] count = {1};
NEWLINE.splitAsStream(input)
       .map(String::trim)
       .forEach(line -> System.out.println(count[0]++ + ": " + line));

It will however result in a zero-length match for "" after the last \n.

1: foo
2: bar
3: 
4: 
5: zul
6: 
7: 
8: 
Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • There is an extra line (i.e. 8) which I would like to avoid. Maybe using a `reduce` rather than `forEach`, like `reduce((prev, next) --> .../* use prev */ )` solves it. Is it reasonable? – Miguel Gamboa Jan 15 '19 at 11:50
  • Your "extra" line 8 is the same as lines 6 and 7, and zero-length match. There is nothing special about line 8 and there is no way to remove it quickly with a `Pattern` as far as I know. You could do `split(regex, 7)` if you know there are `7` entries. – Karol Dowbecki Jan 15 '19 at 11:53
  • Yes, but with the `reduce((prev, next) --> ...` and ignoring `next` we will avoid the last one. – Miguel Gamboa Jan 15 '19 at 11:55
  • 1
    `String.split(pattern, -1)` creates a new array which I mentioned in OP that I would like to avoid, so I would not include that option in your answer. – Miguel Gamboa Jan 15 '19 at 11:56
1

As an alternative you can make your own implementation of an equivalent method to splitAsStream(), which includes trailing empty strings and still avoids the instantiation of an array, such as:

static Stream<String> splitAsStream(Pattern p, CharSequence input) {
    Spliterator<String> iter = new Spliterators.AbstractSpliterator<String>(
        Long.MAX_VALUE,
        Spliterator.ORDERED | Spliterator.SIZED
    ) {
        int index = 0;
        final Matcher m = p.matcher(input);

        @Override
        public boolean tryAdvance(Consumer<? super String> action) {
            while(m.find()) {
                if (index != 0 || index != m.start() || m.start() != m.end()) {
                    action.accept(input.subSequence(index, m.start()).toString());
                    index = m.end();
                    return true;
                }
            }
            if(index < input.length()) {
                // Add remaining segment
                action.accept(input.subSequence(index, input.length()).toString());
                index = input.length();
                return true;
            } else {
                return false;
            }
        }
    };
    return StreamSupport.stream(iter, false);
}
Miguel Gamboa
  • 8,855
  • 7
  • 47
  • 94
1

Since java 9 on can use Matcher.results() yielding a Stream<MatchResult>

Pattern.compile("(.*)\\R").matcher(input)
    .results(mr -> System.out.println(count[0]++ + ": " + mr.group(1)));

This guarantees that the final "line" has a terminating \n too. "....\nabc" will discard the last abc though.

For that I think the following should work (note group()), using a lookahead and requiring for the end $ at least one char ..

Pattern.compile(".*(?=\\R)|.$)").matcher(input)
    .results(mr -> System.out.println(count[0]++ + ": " + mr.group()));

A split with -1 and a check on the last entry seems a bit more readable.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Your first approach is correct if you replace `results(...)` by `results().forEach(...)`. The method `results()` is parameter-less – Miguel Gamboa Jan 15 '19 at 13:55