I am attempting to filter/reduce a stream of data that has some duplicated entries in it.
In essence, I attempting to find a better solution to filtering a set of data than what I implemented. We have data that, at its base, are something like this:
Action | Date | Detail
15 | 2016-03-15 |
5 | 2016-03-15 | D1
5 | 2016-09-25 | D2 <--
5 | 2016-09-25 | D3 <-- same day, different detail
4 | 2017-02-08 | D4
4 | 2017-02-08 | D5
5 | 2017-03-01 | D6 <--
5 | 2017-03-05 | D6 <-- different day, same detail; need earliest
5 | 2017-03-08 | D7
5 | 2017-03-10 | D8
...
I need to extract the details such that:
- Only action 5 is selected
- If a detail is the same (e.g, D6 appears twice on different days), the earliest date is selected
These data are loaded into Objects (one instance for each "record"), and there are other fields on the Object but they are not relevant for this filtering. The Detail is stored as a String, the Date as a ZonedDateTime, and the Action is an int
(well, actually an enum
, but here shown as an int
). The Objects are given in a List<Entry>
in chronological order.
I was able to get a working, but what I consider to be suboptimal, solution by doing:
List<Entry> entries = getEntries(); // retrieved from a server
final Set<String> update = new HashSet<>();
List<Entry> updates =
entries.stream()
.filter(e -> e.getType() == 5)
.filter(e -> pass(e, update))
.collect(Collectors.toList());
private boolean pass(Entry ehe, Set<String> update)
{
final String val = ehe.getDetail();
if (update.contains(val)) { return false; }
update.add(val);
return true;
}
But the issue is I had to use this pass()
method and in it checking a Set<String>
to maintain whether a given Detail had alreay been processed. While this approach works, it seems like it should be possible to avoid an external reference.
I tried to use a groupingBy
on the Detail, and it would allow extracting the earliest entry from the list, the problem was I no longer had a date ordering and I had to process the resultant Map<String,List<Entry>>
.
It seems like some reduce operation (if I used that term correctly) here without the use of the pass()
method should be possible, but I am struggling to get a better implementation.
What would be a better approach such that the .filter(e -> pass(e, update))
could be removed?
Thank you!