2

I had a document that looked like the following:

data.txt

100, "some text"
101, "more text"
102, "even more text"

I processed it using regex and returned a new processed documents as the follow:

Stream<String> lines = Files.lines(Paths.get(data.txt);
Pattern regex = Pattern.compile("([\\d{1,3}]),(.*)");

List<MyClass> result = 
  lines.map(regex::matcher)
       .filter(Matcher::find)
       .map(m -> new MyClass(m.group(1), m.group(2)) //MyClass(int id, String text)
       .collect(Collectors.toList());

This returns a list of MyClass processed. Can run in parallel and everything is ok.

The problem is that I now have this:

data2.txt

101, "some text
the text continues in the next line
and maybe in the next"
102, "for a random
number
of lines"
103, "until the new pattern of new id comma appears"

So, I somehow need to join lines that are being read from the stream until a new match appear. (Something like an buffer?)

I tried to Collect strings and then collect MyClass(), but with no success, because I cannot actually split streams.

Reduce comes to mind to concatenate lines, but I'll concatenate just lines and I cannot reduce and generate a new stream of lines.

Any ideas how to solve this with java 8 streams?

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Daiquiri
  • 23
  • 3
  • It seems to me that you need some kind of primitive parser for your input and you could handle not only line-breaks but also escaping of quotes. – Nándor Előd Fekete Oct 27 '16 at 23:33
  • You only have 1 group in your regex. Also, how do you know if the next line is a new ID or part of the previous string? Do they all have quotes? What if the string contains quotes? You might want to use a CSV parser for this. – shmosel Oct 27 '16 at 23:34
  • Strings may contain quotes, for example: 101, "some "te xt and more " text" 102, "this is the next document" I need to somehow buffer os accumulate lines using lambdas? – Daiquiri Oct 27 '16 at 23:37
  • 1
    Looks like your input might be a CSV file. Have you considered using a CSV parser? – dnault Oct 27 '16 at 23:41
  • Thank you for the suggestions. I will try to use https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html – Daiquiri Oct 27 '16 at 23:45
  • See also [CSV API for Java](http://stackoverflow.com/questions/101100/csv-api-for-java?rq=1) (does not produce a `Stream` though) – Didier L Oct 28 '16 at 13:54

1 Answers1

2

This is a job for java.util.Scanner. With the upcoming Java 9, you would write:

List<MyClass> result;
try(Scanner s=new Scanner(Paths.get("data.txt"))) {
    result = s.findAll("(\\d{1,3}),\\s*\"([^\"]*)\"")
                //MyClass(int id, String text)
    .map(m -> new MyClass(Integer.parseInt(m.group(1)), m.group(2))) 
    .collect(Collectors.toList());
}
result.forEach(System.out::println);

but since the Stream producing findAll does not exist under Java 8, we’ll need a helper method:

private static Stream<MatchResult> matches(Scanner s, String pattern) {
    Pattern compiled=Pattern.compile(pattern);
    return StreamSupport.stream(
        new Spliterators.AbstractSpliterator<MatchResult>(1000,
                         Spliterator.ORDERED|Spliterator.NONNULL) {
        @Override
        public boolean tryAdvance(Consumer<? super MatchResult> action) {
            if(s.findWithinHorizon(compiled, 0)==null) return false;
            action.accept(s.match());
            return true;
        }
    }, false);
}

Replacing findAll with this helper method, we get

List<MyClass> result;
try(Scanner s=new Scanner(Paths.get("data.txt"))) {

    result = matches(s, "(\\d{1,3}),\\s*\"([^\"]*)\"")
               // MyClass(int id, String text)
    .map(m -> new MyClass(Integer.parseInt(m.group(1)), m.group(2)))
    .collect(Collectors.toList());
}
Holger
  • 285,553
  • 42
  • 434
  • 765