6

I have a text file that contains multiple reports in it. Each report starts with a literal "REPORT ID" and have a specific value i.e ABCD. For simple case, I want to extract data of only those reports which have their value ABCD for example. And for complexity, I want to extract data of only those reports which have TAG1 value (2nd line)as 1000375351 and report value is same as ABCD.

I have done it using traditional way. My decideAndExtract(String line) function have the required logic. But how can I use Java 9 streams takeWhile and dropWhile methods to efficiently deal with it?

try (Stream<String> lines = Files.lines(filePath)) {
    lines.forEach(this::decideAndExtract);
}

Sample text file data:

REPORT ID: ABCD    
TAG1: 1000375351 PR
DATA1: 7399910002 T
DATA2: 4754400002 B
DATA3     : 1000640
Some Lines Here    
REPORT ID: WXYZ    
TAG1: 1000375351 PR
DATA1: 7399910002 T
DATA2: 4754400002 B
DATA3     : 1000640
Some Lines Here    
REPORT ID: ABCD    
TAG1: 1000375351 PR
DATA1: 7399910002 T
DATA2: 4754400002 B
DATA3     : 1000640
Some Lines Here
Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Tishy Tash
  • 357
  • 2
  • 12
  • I would split the file into sections (using a regex match for the "REPORT ID" delimiter), and then use a simple `filter` to process only those elements you're interested in. – IronMan Aug 02 '19 at 19:56
  • Isn't `REPORT ID: ABCD TAG1: 1000375351 PR DATA1: 7399910002 T DATA2: 4754400002 B DATA3 : 1000640` worth classifying as an entity. (of course with some parsing over the strings)? – Naman Aug 03 '19 at 03:50
  • What are the other lines? Would there be TAG2 followed by DATA1 and DATA2, then TAG3 followed by DATA1 and DATA2.... It's important to provide a general solution. – WJS Aug 05 '19 at 15:32
  • Streams have two main advantages. 1) Reduced intermediate data structures as containers, and 2) expressing the solution in terms of the statement. Unless you are creating a lot of intermediate data structures (e.g. lists and maps) I doubt that a stream approach would be more efficient. Simply reading in the file and doing conditional processing may be the best way to go. That is essentially what streams do but with more overhead to permit more generalized solutions. – WJS Aug 05 '19 at 15:47
  • @WJS From `REPORT` to `DATA3` the structure is constant. However, the positions of TAG1 and DATA3 are interchangeable. After DATA3 there are some random lines. – Tishy Tash Aug 05 '19 at 16:31

3 Answers3

6

It seems to be a common anti-pattern to go for Files.lines, whenever a Stream over a file is needed, regardless of whether processing individual lines is actually needed.

The first tool of your choice, when pattern matching over a file is needed, should be Scanner:

Pattern p = Pattern.compile(
    "REPORT ID: ABCD\\s*\\R"
   +"TAG1\\s*:\\s*(.*?)\\R"
   +"DATA1\\s*:\\s*(.*?)\\R"
   +"DATA2\\s*:\\s*(.*?)\\R"
   +"DATA3\\s*:\\s*(.*?)\\R"); // you can keep this in a static final field

try(Scanner sc = new Scanner(filePath, StandardCharsets.UTF_8);
    Stream<MatchResult> st = sc.findAll(p)) {

    st.forEach(mr -> System.out.println("found tag1: " + mr.group(1)
        + ", data: "+String.join(", ", mr.group(2), mr.group(3), mr.group(4))));
}

It's easy to adapt the pattern, i.e. use

Pattern p = Pattern.compile(
    "REPORT ID: ABCD\\s*\\R"
   +"TAG1: (1000375351 PR)\\R"
   +"DATA1\\s*:\\s*(.*?)\\R"
   +"DATA2\\s*:\\s*(.*?)\\R"
   +"DATA3\\s*:\\s*(.*?)\\R"); // you can keep this in a static final field

as pattern to fulfill your more complex criteria.

But you could also provide arbitrary filter conditions in the Stream:

Pattern p = Pattern.compile(
    "REPORT ID: (.*?)\\s*\\R"
   +"TAG1: (.*?)\\R"
   +"DATA1\\s*:\\s*(.*?)\\R"
   +"DATA2\\s*:\\s*(.*?)\\R"
   +"DATA3\\s*:\\s*(.*?)\\R"); // you can keep this in a static final field

try(Scanner sc = new Scanner(filePath, StandardCharsets.UTF_8);
    Stream<MatchResult> st = sc.findAll(p)) {

    st.filter(mr -> mr.group(1).equals("ABCD") && mr.group(2).equals("1000375351 PR"))
      .forEach(mr -> System.out.println(
          "found data: " + String.join(", ", mr.group(3), mr.group(4), mr.group(5))));
}

allowing more complex constructs than the equals calls of the example. (Note that the group numbers changed for this example.)

E.g., to support a variable order of the data items after the “REPORT ID”, you can use

Pattern p = Pattern.compile("REPORT ID: (.*?)\\s*\\R(((TAG1|DATA[1-3])\\s*:.*?\\R){4})");
Pattern nl = Pattern.compile("\\R"), sep = Pattern.compile("\\s*:\\s*");

try(Scanner sc = new Scanner(filePath, StandardCharsets.UTF_8);
    Stream<MatchResult> st = sc.findAll(p)) {

    st.filter(mr -> mr.group(1).equals("ABCD"))
      .map(mr -> nl.splitAsStream(mr.group(2))
          .map(s -> sep.split(s, 2))
          .collect(Collectors.toMap(a -> a[0], a -> a[1])))
      .filter(map -> "1000375351 PR".equals(map.get("TAG1")))
      .forEach(map -> System.out.println("found data: " + map));
}

findAll is available in Java 9, but if you have to support Java 8, you can use the findAll implementation of this answer.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Well, I don't see `findAll` in the Stream interface. And I am using java 10. Am I missing something? – WJS Aug 05 '19 at 16:35
  • @WJS it's not in the Stream, but `Scanner`. – Eugene Aug 05 '19 at 17:14
  • 1
    Unfortunately, with the [dynamic tag positions](https://stackoverflow.com/questions/57332614/java-9-takewhile-and-dropwhile-to-read-and-skip-certain-lines/57334025#comment101211772_57332614), creating a regex for each format could get very gnarly very quickly, especially if you wanted to extend it any further. – Avi Aug 05 '19 at 19:12
  • 1
    @Avi such requirements should not be buried in comments. But anyway, I already showed that you can combine a simpler regex with a more elaborated filter predicate. Then, it’s a trade off between regex complexity and post-filter work. I added an example for dealing with dynamic tag positions. If that regex still looks too complicated, the extreme end would be to use `"REPORT ID: ((.*?)\\R){4}"` as pattern and do any other filtering in the stream operation. But there’s no way around using a tool with genuine multi-line processing support, like `Scanner`. – Holger Aug 06 '19 at 07:04
1

dropWhile and takeWhile don't work the way you expect. They keep either dropping or processing elements of the stream until the condition is not met any more for one single element.

If you need to check a condition on all elements and choose only some of them, you should use Stream.filter instead.

fps
  • 33,623
  • 8
  • 55
  • 110
0

You can do the search in two steps:

First create list of all reports as a List of String. In below code there was used an indicator to split reaports entries.

String newReportIndicator = "=====";
List<String> reports = Arrays.asList(lines
    .reduce("", (a, l) -> {
      return a +
          ((l.startsWith("REPORT ID: ")) ? newReportIndicator : "") +
          l + System.lineSeparator();
    }).split(newReportIndicator));

After that execute the filtering according your conditions.

The main method that filter:

List<String> reportsToFind = reports
    .stream().filter(r -> {
      List<String> list = Arrays.asList(r.split(System.lineSeparator()));
      String header = list.get(0).trim();
      return (header.endsWith("ABCD")
          && list.stream().filter(l ->
          l.startsWith("TAG1:") && l.endsWith("1000375351 PR")
      ).count() == 1
      );
    })
    .collect(Collectors.toList());

lczapski
  • 4,026
  • 3
  • 16
  • 32