2

I am reading data from an excel file using apache poi and transforming it into a list of object. But now I want to extract any duplicates based on certain rules into another list of that object and also get the non-duplicate list.
Condition to check for a duplicate

  1. name
  2. email
  3. phone number
  4. gst number

Any of these properties can result in a duplicate. which mean or not an and

Party Class

public class Party {

    private String name;

    private Long number;

    private String email;

    private String address;

    private BigDecimal openingBalance;

    private LocalDateTime openingDate;

    private String gstNumber;
   // Getter Setter Skipped
}

Let's say this is the list returned by the logic to excel data so far

    var firstParty = new Party();
    firstParty.setName("Valid Party");
    firstParty.setAddress("Valid");
    firstParty.setEmail("Valid");
    firstParty.setGstNumber("Valid");
    firstParty.setNumber(1234567890L);
    firstParty.setOpeningBalance(BigDecimal.ZERO);
    firstParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var secondParty = new Party();
    secondParty.setName("Valid Party");
    secondParty.setAddress("Valid Address");
    secondParty.setEmail("Valid Email");
    secondParty.setGstNumber("Valid GST");
    secondParty.setNumber(7593612247L);
    secondParty.setOpeningBalance(BigDecimal.ZERO);
    secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var thirdParty = new Party();
    thirdParty.setName("Valid Party 1");
    thirdParty.setAddress("address");
    thirdParty.setEmail("email");
    thirdParty.setGstNumber("gst");
    thirdParty.setNumber(7593612888L);
    thirdParty.setOpeningBalance(BigDecimal.ZERO);
    secondParty.setOpeningDate(DateUtil.getDDMMDateFromString("01/01/2020"));

    var validParties = List.of(firstParty, secondParty, thirdParty);

What I have attempted so far :-


var partyNameOccurrenceMap = validParties.parallelStream()
        .map(Party::getName)
        .collect(Collectors.groupingBy(Function.identity(), HashMap::new, Collectors.counting()));

var partyNameOccurrenceMapCopy = SerializationUtils.clone(partyNameOccurrenceMap);

var duplicateParties = validParties.stream()
        .filter(party-> {
            var occurrence = partyNameOccurrenceMap.get(party.getName());
            if (occurrence > 1) {
                partyNameOccurrenceMap.put(party.getName(), occurrence - 1);
                return true;
            }
            return false;
        })
        .toList();
var nonDuplicateParties = validParties.stream()
        .filter(party -> {
            var occurrence = partyNameOccurrenceMapCopy.get(party.getName());
            if (occurrence > 1) {
                partyNameOccurrenceMapCopy.put(party.getName(), occurrence - 1);
                return false;
            }
            return true;
        })
        .toList();

The above code only checks for party name but we also need to check for email, phone number and gst number.

The code written above works just fine but the readability, conciseness and the performance might be an issue as the data set is large enough like 10k rows in excel file

Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
Yash garg
  • 53
  • 1
  • 7
  • Can you override `equals()` (and `hashCode()`) of your Party class? In this case you can use `distinct()`. More info here: https://stackoverflow.com/questions/27911406/java8-streams-remove-duplicates-with-stream-distinct – Kaiak Aug 03 '22 at 15:30
  • By non-duplicate list do you mean a list of the items which are never duplicated? Or do you mean a list of all items but skipping the duplicate copies? If the latter, if B is a duplicate of A and C is a duplicate of B (but different field), do you want to just get A? Or get A and C because C is not a duplicate of anything in the list. – btilly Aug 03 '22 at 15:39
  • @AlexanderIvanchenko `Function.identity()` is used for String not for the Party class itself. I have mapped it to name property. – Yash garg Aug 03 '22 at 16:46
  • @Kaiak I can override `equals()/hashCode()` but the fields should be compared using a `||` not using `&&` – Yash garg Aug 03 '22 at 16:48
  • @btilly In the situation i.e. B is duplicate of A and C is a duplicate of B, I want to get both A and C because now the list contain two completely different elements – Yash garg Aug 03 '22 at 16:51
  • @Yashgarg No, sorry, it wouldn't work. I've explained in the answer why this approach of using a broken `equals/hashCode` implementation is **not** viable. – Alexander Ivanchenko Aug 03 '22 at 20:02

2 Answers2

2

Never ignore Equals/hashCode contract

name, email, number, gstNumber

Any of these properties can result in a duplicate, which mean or

Your definition of a duplicate implies that any of these properties should match, whilst others might not.

It means that it's impossible to provide an implementation equals/hashCode that would match the given definition and doesn't violate the hashCode contract.

If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.

I.e. if you implement equals in such a way they any (not all) of these properties: name, email, number, gstNumber could match, and that would enough to consider the two objects equal, then there's no way to implement hashCode correctly.

And as the consequence of this, you can't use the object with a broken equals/hashCode implementation in with a hash-based Collection because equal objects might end up the in the different bucket (since they can produce different hashes). I.e. HashMap would not be able to recognize the duplicated keys, hence groupingBy with groupingBy() with Function.identity() as a classifier function would not work properly.

Therefore, to address this problem, you need to implement equals() based on all 4 properties: name, email, number, gstNumber (i.e. all these values have to be equal), and similarly all these values must contribute to hash-code.

How to determine Duplicates

There's no easy way to determine duplicates by multiple criteria. The solution you've provided is not viable, since we can't rely on the equals/hashCode.

The only way is to generate a HashMap separately for each end every attribute (i.e. in this case we need 4 maps). But can we alternate this, avoiding repeating the same steps for each map and hard coding the logic?

Yes, we can.

We can create a custom generic accumulation type (it would be suitable for any class - no hard-coded logic) that would encapsulate all the logic of determining duplicates and maintain an arbitrary number of maps under the hood. After consuming all the elements from the given collection, this custom object would be aware of all the duplicates in it.

That's how it can be implemented.

A custom accumulation type that would be used as container of a custom Collector. Its constructor expects varargs of functions, each function correspond to the property that should be taken into account while checking whether an object is a duplicate.

public static class DuplicateChecker<T> implements Consumer<T> {
    
    private List<DuplicateHandler<T>> handles;
    private Set<T> duplicates;

    @SafeVarargs
    public DuplicateChecker(Function<T, ?>... keyExtractors) {
        this.handles = Arrays.stream(keyExtractors)
            .map(DuplicateHandler::new)
            .toList();
    }

    @Override
    public void accept(T t) {
        handles.forEach(h -> h.accept(t));
    }
    
    public DuplicateChecker<T> merge(DuplicateChecker<T> other) {
        for (DuplicateHandler<T> handler: handles) {
            other.handles.forEach(handler::merge);
        }
        
        return this;
    }

    public DuplicateChecker<T> finish() {
        duplicates = handles.stream()
            .flatMap(handler -> handler.getDuplicates().stream())
            .flatMap(Set::stream)
            .collect(Collectors.toSet());
        
        return this;
    }
    
    public boolean isDuplicate(T t) {
        return duplicates.contains(t);
    }
}

A helper class representing a single createrion (like name, email, etc.) which encapsulates a HashMap. keyExtractor is used to obtain a key from an object of type T.

public static class DuplicateHandler<T> implements Consumer<T> {
    private Map<Object, Set<T>> itemByKey = new HashMap<>();
    private Function<T, ?> keyExtractor;

    public DuplicateHandler(Function<T, ?> keyExtractor) {
        this.keyExtractor = keyExtractor;
    }

    @Override
    public void accept(T t) {
        itemByKey.computeIfAbsent(keyExtractor.apply(t), k -> new HashSet<>()).add(t);
    }

    public void merge(DuplicateHandler<T> other) {
        other.itemByKey.forEach((k, v) ->
            itemByKey.merge(k,v,(oldV, newV) -> { oldV.addAll(newV); return oldV; }));
    }
    
    public Collection<Set<T>> getDuplicates() {
        Collection<Set<T>> duplicates = itemByKey.values();
        duplicates.removeIf(set -> set.size() == 1); // the object is proved to be unique by this particular property
        
        return duplicates;
    }
}

And that is the method, responsible for generating the map of duplicates, that would be used from the clean code. The given collection would be partitioned into two parts: one mapped to the key true - duplicates, another mapped to the key false - unique objects.

public static <T> Map<Boolean, List<T>> getPartitionByProperties(Collection<T> parties,
                                                                 Function<T, ?>... keyExtractors) {
    
    DuplicateChecker<T> duplicateChecker = parties.stream()
        .collect(Collector.of(
            () -> new DuplicateChecker<>(keyExtractors),
            DuplicateChecker::accept,
            DuplicateChecker::merge,
            DuplicateChecker::finish
        ));
    
    return parties.stream()
        .collect(Collectors.partitioningBy(duplicateChecker::isDuplicate));
}

And that how you can apply it for your particular case.

main()

public static void main(String[] args) {
    List<Party> parties = // initializing the list of parties
    
    Map<Boolean, List<Party>> isDuplicate = partitionByProperties(parties,
            Party::getName, Party::getNumber,
            Party::getEmail, Party::getGstNumber);
}
Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
  • Perfecttttt!!!! Thank You so much. Its functional and way more cleaner and concise. Just a small change according to my requirement i.e instead of using set in `DuplicateHandler` Class we can use a `List` to preserve the occurrence of duplicate and then when calling the `finish()` method instead of `.flatMap(Set::stream)` we can use `.flatMap(values -> values.stream().skip(1L))` to just get the one value. – Yash garg Aug 03 '22 at 20:02
  • @Yashgarg `get the duplicates from the indiviual Duplicate Handler class` - all the objects that are proved to be duplicates while checking only a particular property. Sure, that's that doable. You can expose a `handles` via getter. But not that it's encapsulated on purpose, from the perspective of clean-coding it would be better to introduce a method for specific a specific use case, for instance which would return unmodifiable `Set` mapped to a specific property, meaning that corresponds to a function like `Party::getEmail`. – Alexander Ivanchenko Aug 03 '22 at 20:16
  • @Yashgarg `i.e instead of using set in DuplicateHandler Class we can use a List to preserve the occurrence` - `List` will do if there is a guarantee that you have no entries in your file having identical name, email, number and gstNumber (which might happens due to mere human mistake). – Alexander Ivanchenko Aug 03 '22 at 20:26
  • That's what we need to achieve, suppose that we have 10 records having the same name then we need to just get the first record and move the remaining 9 records to the duplicate list. – Yash garg Aug 03 '22 at 20:39
1

I would use create a map for each property where

  • key is the property we want to check duplicate
  • value is a Set containing all the index of element in the list with same key.

Then we can

  1. filter values in the map with more that 1 index (i.e. duplicate indexes).
  2. union all the duplicate index
  3. determine if the element is duplicate/unique by using the duplicate index.

The time complexity is roughly O(n).

public class UniquePerEachProperty {
    private static void separate(List<Party> partyList) {
        Map<String, Set<Integer>> nameToIndexesMap = new HashMap<>();
        Map<String, Set<Integer>> emailToIndexesMap = new HashMap<>();
        Map<Long, Set<Integer>> numberToIndexesMap = new HashMap<>();
        Map<String, Set<Integer>> gstNumberToIndexesMap = new HashMap<>();
        for (int i = 0; i < partyList.size(); i++) {
            Party party = partyList.get(i);
            nameToIndexesMap.putIfAbsent(party.getName(), new HashSet<>());
            nameToIndexesMap.get(party.getName()).add(i);

            emailToIndexesMap.putIfAbsent(party.getEmail(), new HashSet<>());
            emailToIndexesMap.get(party.getEmail()).add(i);

            numberToIndexesMap.putIfAbsent(party.getNumber(), new HashSet<>());
            numberToIndexesMap.get(party.getNumber()).add(i);

            gstNumberToIndexesMap.putIfAbsent(party.getGstNumber(), new HashSet<>());
            gstNumberToIndexesMap.get(party.getGstNumber()).add(i);
        }
        Set<Integer> duplicatedIndexes = Stream.of(
                        nameToIndexesMap.values(),
                        emailToIndexesMap.values(),
                        numberToIndexesMap.values(),
                        gstNumberToIndexesMap.values()
                ).flatMap(Collection::stream).filter(indexes -> indexes.size() > 1)
                .flatMap(Set::stream).collect(Collectors.toSet());
        List<Party> duplicatedList = new ArrayList<>();
        List<Party> uniqueList = new ArrayList<>();
        for (int i = 0; i < partyList.size(); i++) {
            Party party = partyList.get(i);
            if (duplicatedIndexes.contains(i)) {
                duplicatedList.add(party);
            } else {
                uniqueList.add(party);
            }
        }
        System.out.println("duplicated:" + duplicatedList);
        System.out.println("unique:" + uniqueList);
    }

    public static void main(String[] args) {
        separate(List.of(
                // name duplicate
                new Party("name1", 1L, "email1", "gstNumber1"),
                new Party("name1", 2L, "email2", "gstNumber2"),
                // number duplicate
                new Party("name3", 3L, "email3", "gstNumber3"),
                new Party("name4", 3L, "email4", "gstNumber4"),
                // email duplicate
                new Party("name5", 5L, "email5", "gstNumber5"),
                new Party("name6", 6L, "email5", "gstNumber6"),
                // gstNumber duplicate
                new Party("name7", 7L, "email7", "gstNumber7"),
                new Party("name8", 8L, "email8", "gstNumber7"),
                // unique
                new Party("name9", 9L, "email9", "gstNumber9")
        ));
    }
}

Assume Party has below constructor and toString()(for testing)

public class Party {
    public Party(String name, Long number, String email, String gstNumber) {
        this.name = name;
        this.number = number;
        this.email = email;
        this.address = "";
        this.openingBalance = BigDecimal.ZERO;
        this.openingDate = LocalDateTime.MIN;
        this.gstNumber = gstNumber;
    }

    @Override
    public String toString() {
        return "Party{" +
                "name='" + name + '\'' +
                ", number=" + number +
                ", email='" + email + '\'' +
                ", gstNumber='" + gstNumber + '\'' +
                '}';
    }
    ...
}
samabcde
  • 6,988
  • 2
  • 25
  • 41
  • Thanks @samabcde, your logic pretty much covered everything but it excludes both the values, whereas we need to preserve only one value. – Yash garg Aug 03 '22 at 17:35
  • I wanted to preserve the first duplicate and discard the remaining duplicates that's why I used `List` instead of `Set` to preserve their order of occurrence in the list and then used `flatMap(indexes -> indexes.stream().skip(1L))` after the `.filter(indexes -> indexes.size() > 1)` and removed ` flatMap(Set::stream)` of course. – Yash garg Aug 03 '22 at 17:48
  • And is there any more functional way of writing this thing using streams? Please any one? – Yash garg Aug 03 '22 at 17:52