2

I want to have more than one regex as below, how can I add that to flatmap iterator to put all matching values of the line to List during a single stream read?

static String reTimeStamp="((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
static String reHostName="host=(\\\")((?:[a-z][a-z\\.\\d\\-]+)\\.(?:[a-z][a-z\\-]+))(?![\\w\\.])(\\\")";
static String reServiceTime="service=(\\d+)ms";

private static final PatternStreamer quoteRegex1 = new PatternStreamer(reTimeStamp);
private static final PatternStreamer quoteRegex2 = new PatternStreamer(reHostName);
private static final PatternStreamer quoteRegex3 = new PatternStreamer(reServiceTime);


public static void main(String[] args) throws Exception {
    String inFileName = "Sample.log";
    String outFileName = "Sample_output.log";
    try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
        //stream.forEach(System.out::println);
        List<String> timeStamp = stream.flatMap(quoteRegex1::results)
                                    .map(r -> r.group(1))
                                    .collect(Collectors.toList());

        timeStamp.forEach(System.out::println);
        //Files.write(Paths.get(outFileName), dataSet);
    }
}

This question is a extension from Match a pattern and write the stream to a file using Java 8 Stream

Holger
  • 285,553
  • 42
  • 434
  • 765
Shan
  • 75
  • 9

1 Answers1

3

You can simply concatenate the streams:

String inFileName = "Sample.log";
String outFileName = "Sample_output.log";
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> Stream.concat(quoteRegex1.results(s),
                        Stream.concat(quoteRegex2.results(s), quoteRegex3.results(s))))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

but note that this will perform three individual searches through each line, which might not only imply lower performance, but also that the order of the matches within one line will not reflect their actual occurrence. It doesn’t seem to be an issue with your patterns, but individual searches even imply possible overlapping matches.

The PatternStreamer of that linked answer also greedily collects the matches of one string into an ArrayList before creating a stream. A Spliterator based solution like in this answer is preferable.

Since numerical group references preclude just combining the patterns in a (pattern1|pattern2|pattern3) manner, a true streaming over matches of multiple different patterns will be a bit more elaborated:

public final class MultiPatternSpliterator
extends Spliterators.AbstractSpliterator<MatchResult> {
    public static Stream<MatchResult> matches(String input, String... patterns) {
        return matches(input, Arrays.stream(patterns)
                .map(Pattern::compile).toArray(Pattern[]::new));
    }
    public static Stream<MatchResult> matches(String input, Pattern... patterns) {
        return StreamSupport.stream(new MultiPatternSpliterator(patterns,input), false);
    }
    private Pattern[] pattern;
    private String input;
    private int pos;
    private PriorityQueue<Matcher> pendingMatches;

    MultiPatternSpliterator(Pattern[] p, String inputString) {
        super(inputString.length(), ORDERED|NONNULL);
        pattern = p;
        input = inputString;
    }

    @Override
    public boolean tryAdvance(Consumer<? super MatchResult> action) {
        if(pendingMatches == null) {
            pendingMatches = new PriorityQueue<>(
                pattern.length, Comparator.comparingInt(MatchResult::start));
            for(Pattern p: pattern) {
                Matcher m = p.matcher(input);
                if(m.find()) pendingMatches.add(m);
            }
        }
        MatchResult mr = null;
        do {
            Matcher m = pendingMatches.poll();
            if(m == null) return false;
            if(m.start() >= pos) {
                mr = m.toMatchResult();
                pos = mr.end();
            }
            if(m.region(pos, m.regionEnd()).find()) pendingMatches.add(m);
        } while(mr == null);
        action.accept(mr);
        return true;
    }
}

This facility allows to match multiple pattern in a (pattern1|pattern2|pattern3) fashion while still having the original groups of each pattern. So when searching for hell and llo in hello, it will find hell and not llo. A difference is that there is no guaranteed order if more than one pattern matches at the same position.

This can be used like

Pattern[] p = Stream.of(reTimeStamp, reHostName, reServiceTime)
        .map(Pattern::compile)
        .toArray(Pattern[]::new);
try (Stream<String> stream = Files.lines(Paths.get(inFileName))) {
    List<String> timeStamp = stream
        .flatMap(s -> MultiPatternSpliterator.matches(s, p))
        .map(r -> r.group(1))
        .collect(Collectors.toList());

    timeStamp.forEach(System.out::println);
    //Files.write(Paths.get(outFileName), dataSet);
}

While the overloaded method would allow to use MultiPatternSpliterator.matches(s, reTimeStamp, reHostName, reServiceTime) using the pattern strings to create a stream, this should be avoided within a flatMap operation that would recompile every regex for every input line. That’s why the code above compiles all patterns into an array first. This is what your original code also does by instantiating the PatternStreamers outside the stream operation.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • nice explaination. – holi-java Sep 11 '17 at 13:56
  • Also noticed a strange behaviour while reading only on a big file (5GB), if I match only 2 patterns (eg., Stream.of(reTimeStamp, reHostName)), the stream was able to read the whole file flawlessly within 10 mins and print the output. The moment I added the 3rd pattern like Stream.of(reTimeStamp, reHostName, reServiceTime) and run again on same file java process just hangs by keeping the file in memory forever(monitored through VisualVM) and not crashing with any error. This was the same way for Stream.concat(regex1,regex2) - works. Stream.concat(regex1,regex2,regex3) - Java Process hangs. – Shan Sep 13 '17 at 08:18
  • That might depend on the actual regex pattern more than the number of patterns, but anyway it seems to be worth [asking a new question](https://stackoverflow.com/questions/ask)… – Holger Sep 13 '17 at 08:22
  • I ran the code with increased the HeapSpace from 5G to -Xmx 10G -Xms 10G , with this I am able to parse with all the regex on the big file. I will work on this as temporary solution, Any better solution to handle from code itself is performance optimal. Thanks @Holger for helping through. – Shan Sep 13 '17 at 23:22