0

I'm trying to read a huge file and extract the text within "quotes" and put the lines into a set and write the content of the set to a file using Java 8 Stream.

public class DataMiner {

    private static final Pattern quoteRegex = Pattern.compile("\"([^\"]*)\"");

    public static void main(String[] args) {

        String fileName = "c://exec.log";
        try (Stream<String> stream = Files.lines(Paths.get(fileName))) {
            Set<String> dataSet = stream.
                    //How do I Perform pattern match here
                    .collect(Collectors.toSet());
            Files.write(Paths.get(fileName), dataSet);

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Please help me. Thanks!

EDIT: Answers to the questions..

  1. No there are no multiple quoted texts.
  2. I could have used simple loop. But I want to use Java 8 streams
Damien-Amen
  • 7,232
  • 12
  • 46
  • 75
  • Use [`.map(...)`](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#map-java.util.function.Function-) – khelwood May 25 '16 at 16:06
  • 1) Can quoted text span multiple lines? 2) If a line has `abc "def" ghi "jkl" mno`, what should be collected? – Andreas May 25 '16 at 16:07
  • the docs are always a good point to start. E.g. [`Stream#map(Function)`](http://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#map-java.util.function.Function-) –  May 25 '16 at 16:12
  • @khelwood I am not sure that this will work for what he is asking. Lets assume he uses `.map(...)` he would be able to write split the strings, but the returned strings would be groups in an array or some other structure. But the stream he is working on is expecting a string. Do you have a example of how he would do this? – Mr00Anderson May 25 '16 at 16:16
  • 2
    Any reasons you want to use Java 8 streams instead of simple loop? Also what is the point of creating temporary Set which will store all results. You could write each one of found result in file directly (assuming it is not same file from which you are reading). – Pshemo May 25 '16 at 16:16
  • @Pshemo this is what i was thinking exactly. In a couple of lines he could solve the same problem. He might even be able to just use `String.split(...)` and just use `Arrays.asList(arrayOfStrings);` – Mr00Anderson May 25 '16 at 16:18
  • Just show me your sample exec.log file – Noor Nawaz May 25 '16 at 16:27
  • @Underbalanced I had in mind something like `.map(str -> quoteRegex.matcher(str).find().group(1))` but obviously it depends on the undisclosed details of his/her vague requirements. – khelwood May 26 '16 at 08:46

1 Answers1

4

Unfortunately, the Java regular expression classes don't provide a stream for matched results, only a splitAsStream() method, but you don't want split.

Note: It has been added in Java 9 as Matcher.results().

You can however create a generic helper class for it yourself:

public final class PatternStreamer {
    private final Pattern pattern;
    public PatternStreamer(String regex) {
        this.pattern = Pattern.compile(regex);
    }
    public Stream<MatchResult> results(CharSequence input) {
        List<MatchResult> list = new ArrayList<>();
        for (Matcher m = this.pattern.matcher(input); m.find(); )
            list.add(m.toMatchResult());
        return list.stream();
    }
}

Then your code becomes easy by using flatMap():

private static final PatternStreamer quoteRegex = new PatternStreamer("\"([^\"]*)\"");
public static void main(String[] args) throws Exception {
    String inFileName = "c:\\exec.log";
    String outFileName = "c:\\exec_quoted.txt";
    try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
        Set<String> dataSet = stream.flatMap(quoteRegex::results)
                                    .map(r -> r.group(1))
                                    .collect(Collectors.toSet());
        Files.write(Paths.get(outFileName), dataSet);
    }
}

Since you only process a line at a time, the temporary List is fine. If the input string is very long and will have a lot of matches, then a Spliterator would be a better choice. See How do I create a Stream of regex matches?

Community
  • 1
  • 1
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Perfect! exactly what I wanted to know/learn. Thanks a lot @Andreas – Damien-Amen May 25 '16 at 17:40
  • That is pretty cool that this was added in Java 9. Must have overlooked it when I was looking at the changes. – Mr00Anderson May 26 '16 at 12:12
  • 1
    If you create a stream via a temporary data structure instead of on-the-fly, like in [this answer](https://stackoverflow.com/a/28150956/2711488), it’s recommended to use [`Stream.Builder`](https://docs.oracle.com/javase/8/docs/api/?java/util/stream/Stream.Builder.html) (see also [`Stream.builder()`](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#builder--)) instead of `ArrayList`, as this builder is especially optimized for this use case. – Holger Sep 11 '17 at 11:38