How to identify duplicate records in a list?

Question

I have the following problem:

I want to remove duplicate data from a list of a Vo depending if the registered field is the same, I show you the solution that I am trying. Then this is the data from the list that I am making

List<MyVo> dataList = new ArrayList<MyVo>();

MyVo  data1 = new MyVo();
data1.setValidated(1);
data1.setName("Fernando");
data1.setRegistered("008982");

MyVo data2 = new MyVo();
data2.setValidated(0);
data2.setName("Orlando");
data2.setRegistered("008986");

MyVo data3 = new MyVo();
data3.setValidated(1);
data3.setName("Magda");
data3.setRegistered("008982");


MyVo data4 = new MyVo();
data4.setValidated(1);
data4.setName("Jess");
data4.setRegistered("006782");

dataList.add(data1);
dataList.add(data2);
dataList.add(data3);
dataList.add(data4);

The first thing I have to do and separate it into two different lists depending on whether the data is validated or not, for that the value of the registered validated.

List<MyVo> registeredBusinesses = new ArrayList<MyVo>();
List<MyVo> unregisteredBusinesses = new ArrayList<MyVo>();

for (MyVo map : dataList) {
    if (map.getValidated == 0) {
        unregisteredBusinesses.add(map);
    }else {
        registeredBusinesses.add(map);
    }
}

now the list of registered businesses I want to remove the data that is repeated with the same value from its registered field and make a new list. this is what it took but it doesn't work right

List<MyVo> duplicateList = registeredBusinesses.stream().filter(distictByRegistered(MyVo::getRegistered)).collect(Collectors.toList());


public static <T> Predicate<T> distictByRegistered(Function<?      super T, ?> keyExtractor) {
    Set<Object> seen = ConcurrentHashMap.newKeySet();
    return t -> seen.add(keyExtractor.apply(t));
}

however using this method I get the following output:

{["validated":1,"name":"Fernando","registered":"008982"], ["validated":1,"name":"Jess","registered":"006782"]}

the output I want to obtain is the following:

the unregisteredBusinesses list:

{["validated":0,"name":"Orlando","registered":"008986"]}

the registeredBusinesses list:

{["validated":1,"name":"Jess","registered":"006782"]}

the registeredDuplicateBusinesses list:

{["validated":1,"name":"Fernando","registered":"008982"], 
["validated":1,"name":"Magda","registered":"008982"]}

I don't know how to do it, could you help me? I would like to use lambdas to reduce the code, for example of the first for when I separate into two lists

I fail to understand: you want to filter out all duplicate when `rregistered` is equals to 1 and based on `registered` value? Why are you not using `MyVo::getRegistered` instead of `Function.identity` as key? Then, use a merger (that throw away the duplicate (the default merger will throw an exception): https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-java.util.stream.Collector- — NoDataFound, May 17 '21 at 13:31
Hello, I already tried what you asked me but it still does not come out, edit the question, I do not know in what or you tell it to compare with the specific field to know if it is repeated or not — Sebastian Ruiz, May 17 '21 at 16:34

score 1 · Answer 1 · answered May 17 '21 at 13:26

Your approach looks almost correct, grouping by Function.identity() will properly flag duplicates (based on equals() implementation!), you could also group by an unique property/id in your object if you have one, what you're missing is to manipulate the resulting map to get a list with all duplicates. I've added comments describing what's happening here.

List<MyVo> duplicateList = registeredBusinesses.stream()
    .collect(Collectors.groupingBy(Function.identity()))
    .entrySet()
    .stream()
    .filter(e -> e.getValue().size() > 1) //this is a stream of Map.Entry<MyVo, List<MyVo>>, then we want to check value.size() > 1
    .map(Map.Entry::getValue) //We convert this into a Stream<List<MyVo>>
    .flatMap(Collection::stream) //Now we want to have all duplicates in the same stream, so we flatMap it using Collections::stream
    .collect(Collectors.toList()); //On this stage we have a Stream<MyVo> with all duplicates, so we can collect it to a list.

Additionally, you could also use stream API to split dataList into registered and unRegistered.

First we create a method isUnregistered in MyVo

public boolean isUnregistered() {
  return getrRegistered() == 0;
}

Then

Map<Boolean, List<MyVo>> registeredMap = dataList.stream().collect(Collectors.groupingBy(MyVo::isUnregistered));

Where map.get(true) will be unregisteredBusinesses and map.get(false) registeredBusinesses

Hello, I already tried what you asked me but it still does not come out, edit the question, I do not know in what or you tell it to compare with the specific field to know if it is repeated or not — Sebastian Ruiz, May 17 '21 at 16:34
Function.identity() compares the entries by using itself, this is by checking the equals() method, you need to override equals() in your MyVo object and make it compare whatever means for you "unique". — Yayotrón, May 17 '21 at 18:29

score 1 · Accepted Answer · answered Jul 30 '21 at 01:48

You are looking for both registered and unregistered businesses. This is where instead of making use of 0 and 1, you could choose to implement the attribute as a boolean isRegistered such as 0 is false and 1 is true going forward. Your existing code with if-else could be re-written as :

Map<Boolean, List<MyVo>> partitionBasedOnRegistered = dataList.stream()
         .collect(Collectors.partitioningBy(MyVo::isRegistered));
List<MyVo> unregisteredBusinesses = partitionBasedOnRegistered.get(Boolean.FALSE); // here
List<MyVo> registeredBusinesses = partitionBasedOnRegistered.get(Boolean.TRUE);

score 0 · Answer 3 · answered May 17 '21 at 14:41

Familiarizing yourself with the concept of the Collectors.partitioningBy shall help you problem-solve this further. There are two places amongst your current requirement where it could be implied.

You are looking for both registered and unregistered businesses. This is where instead of making use of 0 and 1, you could choose to implement the attribute as a boolean isRegistered such as 0 is false and 1 is true going forward. Your existing code with if-else could be re-written as :

Map<Boolean, List<MyVo>> partitionBasedOnRegistered = dataList.stream()
         .collect(Collectors.partitioningBy(MyVo::isRegistered));
List<MyVo> unregisteredBusinesses = partitionBasedOnRegistered.get(Boolean.FALSE); // here
List<MyVo> registeredBusinesses = partitionBasedOnRegistered.get(Boolean.TRUE);

After you try to groupBy the registered businesses based on the registration number(despite of identity), you require both the duplicate elements and the ones which are unique as well. Effectively all entries, but again partitioned into two buckets, i.e. one with value size == 1 and others with size > 1. Since grouping would ensure, minimum one element corresponding to each key, you can collect the required output with an additional mapping.
```
Map<String, List<MyVo>> groupByRegistrationNumber = // group registered businesses by number

Map<Boolean, List<List<MyVo>>> partitionBasedOnDuplicates = groupByRegistrationNumber
         .entrySet().stream()
         .collect(Collectors.partitioningBy(e -> e.getValue().size() > 1,
                 Collectors.mapping(Map.Entry::getValue, Collectors.toList())));
```
If you access the FALSE values of the above map, that would provide you the groupedRegisteredUniqueBusiness and on the other hand values against TRUE key would provide you groupedRegisteredDuplicateBusiness.

Do take a note, that if you were to flatten this List<List<MyVo> in order to get List<MyVo> as output, you could also make use of the flatMapping collector which has a JDK inbuilt implementation with Java-9 and above.

Hello, I already tried what you asked me but it still does not come out, edit the question, I do not know in what or you tell it to compare with the specific field to know if it is repeated or not — Sebastian Ruiz, May 17 '21 at 16:35
@SebastianRuiz `groupByRegistrationNumber` would be specific to a field such as `...collect(Collectors.groupingBy(MyVo::registrationNumber)`, and the count of values after grouping would convey if they are duplicates or not, that is where `e.getValue().size() > 1` is used. — Naman, May 18 '21 at 02:25

How to identify duplicate records in a list?

3 Answers3