8

I am attempting to filter/reduce a stream of data that has some duplicated entries in it.

In essence, I attempting to find a better solution to filtering a set of data than what I implemented. We have data that, at its base, are something like this:

Action | Date         | Detail
15     | 2016-03-15   | 
5      | 2016-03-15   | D1
5      | 2016-09-25   | D2      <--
5      | 2016-09-25   | D3      <-- same day, different detail
4      | 2017-02-08   | D4
4      | 2017-02-08   | D5
5      | 2017-03-01   | D6      <--
5      | 2017-03-05   | D6      <-- different day, same detail; need earliest
5      | 2017-03-08   | D7
5      | 2017-03-10   | D8
...

I need to extract the details such that:

  • Only action 5 is selected
  • If a detail is the same (e.g, D6 appears twice on different days), the earliest date is selected

These data are loaded into Objects (one instance for each "record"), and there are other fields on the Object but they are not relevant for this filtering. The Detail is stored as a String, the Date as a ZonedDateTime, and the Action is an int (well, actually an enum, but here shown as an int). The Objects are given in a List<Entry> in chronological order.

I was able to get a working, but what I consider to be suboptimal, solution by doing:

  List<Entry> entries = getEntries(); // retrieved from a server

  final Set<String> update = new HashSet<>();
  List<Entry> updates =
  entries.stream()
    .filter(e -> e.getType() == 5)
    .filter(e -> pass(e, update))
    .collect(Collectors.toList());


private boolean pass(Entry ehe, Set<String> update)
   {
     final String val =  ehe.getDetail();
     if (update.contains(val)) { return false; }
     update.add(val);
     return true;
   }

But the issue is I had to use this pass() method and in it checking a Set<String> to maintain whether a given Detail had alreay been processed. While this approach works, it seems like it should be possible to avoid an external reference.

I tried to use a groupingBy on the Detail, and it would allow extracting the earliest entry from the list, the problem was I no longer had a date ordering and I had to process the resultant Map<String,List<Entry>>.

It seems like some reduce operation (if I used that term correctly) here without the use of the pass() method should be possible, but I am struggling to get a better implementation.

What would be a better approach such that the .filter(e -> pass(e, update)) could be removed?

Thank you!

Alexis C.
  • 91,686
  • 21
  • 171
  • 177
KevinO
  • 4,303
  • 4
  • 27
  • 36
  • It’s so close to being a duplicate of [Java Stream: get latest version of user records](http://stackoverflow.com/questions/42671078/java-stream-get-latest-version-of-user-records).See if you can solve it using [my answer there](http://stackoverflow.com/a/42671696/5772882). Or one of the other answers. – Ole V.V. May 04 '17 at 11:33
  • @OleV.V., I will take a look. I didn't see this Q&A when I did some searching. – KevinO May 04 '17 at 11:36
  • 6
    While using this `pass` method in Streams is indeed discouraged (and you have an answer with a better solution), just a general note for dealing with `Set`s: `Set.add` is already defined as “add if not present” and it will return whether the value has been added, hence, instead of `if (update.contains(val)) { return false; } update.add(val); return true;`, performing two hash lookups, you can just use `return update.add(val);`, being shorter and more efficient. – Holger May 04 '17 at 12:34
  • @Holger, good point about the `Set` and the `update.add()`. Clearly an issue of my focusing on the stream issue without thinking clearly about the `Set`. I appreciate the detailed critique! – KevinO May 04 '17 at 12:38

4 Answers4

10

Two solutions in this answer of which the second is significantly faster.

Solution 1

An adaptation of the answer by Ole V.V. on another question:

Collection<Entry> result = 
 entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.groupingBy(Entry::getDetail, Collectors.collectingAndThen(Collectors.minBy(Comparator.comparing(Entry::getDate)), Optional::get)))
  .values();

With your example dataset you end up with (I picked GMT+0 as timezone):

Entry [action=5, date=2017-03-01T00:00Z[GMT], detail=D6]
Entry [action=5, date=2017-03-08T00:00Z[GMT], detail=D7]
Entry [action=5, date=2017-03-10T00:00Z[GMT], detail=D8]
Entry [action=5, date=2016-03-15T00:00Z[GMT], detail=D1]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D2]
Entry [action=5, date=2016-09-25T00:00Z[GMT], detail=D3]

If you insist on getting a List back:

List<Entry> result = new ArrayList<>(entries.stream() ..... .values());

If you want to get your original order back use the 3-parameter groupingBy:

...groupingBy(Entry::getDetail, LinkedHashMap::new, Collectors.collectingAndThen(...))

Solution 2

Using toMap, which is easier to read and is faster (see comment by holi-java on this answer, and next 'section'):

List<Entry> col = new ArrayList<>(
  entries.stream().filter(e -> e.getAction() == 5)
  .collect(Collectors.toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a))
  .values());

where (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a can be substituted with:

BinaryOperator.minBy(Comparator.comparing(Entry::getDate))

If you want to get your original order back in this solution use the 4-parameter toMap:

...toMap(Entry::getDetail, Function.identity(), (a,b) -> a.getDate().compareTo(b.getDate()) >= 0 ? b : a, LinkedHashMap::new)

Performance

With the testdata I created for testing my solutions, I checked the runtime of both solutions. The first solution takes on average 67 ms (ran it only 20 times, so don't trust the numbers!), the second solution took 2 ms on average. If anyone would like to do a proper performance comparison, put the results in the comments and I'll add it here.

Community
  • 1
  • 1
Robin Topper
  • 2,295
  • 1
  • 17
  • 25
  • 2
    awesome! much better then my answer collecting first to a `Map`, and you are not relying on order (which might not be). one plus – Eugene May 04 '17 at 12:13
  • 1
    Thanks for reusing my answer from the other question, that’s great. If you are using a `java.time` class for the date, I recommend `LocalDate` over `ZonedDateTime` for this task. – Ole V.V. May 04 '17 at 14:51
  • 5
    No sarcasm here, I am serious. I enjoy when something I’ve written seems to be of help to someone, that’s probably the main reason I’m here. I think that my contribution and yours combined into a higher value. – Ole V.V. May 04 '17 at 15:05
  • @Eugene Hi, Eugene. In fact, your way is better, but you solve the problem with the wrong order. and maybe you forget the `minBy` method of `BinaryOperator`. – holi-java May 10 '17 at 18:48
  • @holi-java Since your proposed solution was already in my answer, I've moved some pieces around. Is the second solution faster? It is definitely easier to read. – Robin Topper May 10 '17 at 18:51
  • @RobinTopper Hi, the `toMap` method is more faster & simper than `groupingBy` method. and there is a `minBy` method in `BinaryOperator` that @Eugene maybe has forgot in his answer. – holi-java May 10 '17 at 18:52
  • @holi-java without tests, *saying* that something is faster( or slower) will not make it as such. Do you have any proof for that? – Eugene May 10 '17 at 19:34
  • @Eugene Hi, I think it needn't any test for test performance since `groupingBy` needs more transformation than `toMap` that we can see. – holi-java May 10 '17 at 19:42
  • @holi-java that is not a legitimate argument. How about writing some `jmh` tests for example. Until the, sorry, Im not convinced at all... – Eugene May 10 '17 at 19:44
  • @Eugene In fact, both `groupingBy` and `toMap` are using `CollectorImpl`. so the performance estimation is their process routes. – holi-java May 10 '17 at 19:47
  • @Eugene and holi-java Just added a really rough estimate of the difference in performance. The second solution is much faster indeed – Robin Topper May 10 '17 at 19:49
  • @RobinTopper `20 times` is proof of nothing still... http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – Eugene May 10 '17 at 21:11
  • @Eugene thanks for your feedback. and in turn ask yourself, how do you proof the performance `toMap` is not faster than `groupingBy` in this answer? after discussion, I have wrote some tests which has proof `toMap` is more than 3 times faster than `groupingBy`. and `groupingBy` will using more memory size than `toMap`. – holi-java May 10 '17 at 22:28
3

If I understood correctly...

 List<Entry> result = list.stream().collect(Collectors.toMap(
            Entry::getDetail,
            Function.identity(),
            (left, right) -> {
                return left.getDate().compareTo(right.getDate()) > 0 ? right : left;
            }, LinkedHashMap::new))
            .values()
            .stream()
            .filter(e -> e.getAction() == 5)
            .collect(Collectors.toList());
Eugene
  • 117,005
  • 15
  • 201
  • 306
  • I might have not implemented this suggestion correctly, but as written it seems to lose the date order by using the `HashMap`. I can get the order back either by using a `LinkedHashMap` (as suggested by Manos in another answer) or by adding a `.sorted(Comparator.comparing(Entry::getDate)`. Did I misunderstand the implementation? – KevinO May 04 '17 at 12:23
  • 1
    @KevinO u did not miss anything, I indeed had a type, it should really be `LinkedHashMap` – Eugene May 04 '17 at 12:51
2

The stream interface provides the distinct method for that purpose. It will sort out duplicates based on equals().

Therefore one option would be, to implement your Entry's equals* method accordingly or the other would be to define a Wrapper class, that checks equality based on a specific criteria (i.e. getDetail())

class Wrapper {
   final Entity entity;
   Wrapper(Entity entity){
     this.entity = entity;
   }
   Entity getEntity(){
      return this.entity;
   }
   public boolean equals(Object o){
       if(o instanceof Entity) {
           return entity.getDetail().equals(((Wrapper) o).getEntity().getDetail());
       }
       return false;
   }
    public int hashCode() {

        return entity != null ? entity.getDetail().hashCode() : 0;
    }
}

And than wrap, distinct and unmap your entities:

entries.stream()
       .map(Wrapper::new)
       .distinct()
       .map(Wrapper::getEntity)
       .collect(Collectors.toList());

If the stream is ordered, always the first matchin entry is used. The stream of a list is always ordered.

*) I tried it first without implementing hashCode() and that fails. The reason is, the internals of java.util.stream.DistinctOps uses a HashSet to keep track of elements that have already been processed and it checks for contains, which relies on the hashCode as well as the equals method. So just implementing equals is not enough.

Gerald Mücke
  • 10,724
  • 2
  • 50
  • 67
  • Are you sure to get the *earliest* entry for each detail this way? Doesn’t seem obvious to me. – Ole V.V. May 04 '17 at 11:45
  • 2
    according to doc: `For ordered streams, the selection of distinct elements is stable (for duplicated elements, the element appearing first in the encounter order is preserved.) For unordered streams, no stability guarantees are made` – Gerald Mücke May 04 '17 at 11:45
  • 1
    Thanks. I trust the code to work. I still don’t like that your `equals()` declares objects equal that really are different, though. – Ole V.V. May 04 '17 at 11:47
  • 1
    @GeraldMücke `first` in encounter order `might not be` the one you actually want... – Eugene May 04 '17 at 11:47
  • 1
    The asker takes the first in encounter order in the code that he says is working, so it’s probably good enough. I too would probably prefer taking the earliest date. – Ole V.V. May 04 '17 at 11:49
2

You can create a LinkedHashMap with groupingBy that will preserve the insertion order unlike HashMap. You are saying that the list is already in chronlogical order so preserving the order is enough. It is then straightforward to aggregate the lists in the values of this map. E.g (add static imports):

List<Entry> selected = objs.stream()
        .filter(e -> e.getType() == 5)
        .collect(groupingBy(Entry::getDetail, LinkedHashMap::new, reducing((a, b) -> a)))
        .values().stream()
        .filter(Optional::isPresent)
        .map(Optional::get)
        .collect(toList());

The reducing part will keep the first of 1 or more occurences. Here is the documentation for LinkedHashMap and the specific groupingBy I am using.

Manos Nikolaidis
  • 21,608
  • 12
  • 74
  • 82