HashSet not removing all duplicate entries

Question

I am attempting to use a HashSet to make sure data I read in from a .txt file are unique.

Below is the sample data;

999990  bummer
999990  bummer
999990  bummer
999990  bummer
99999   bummer
999990  bummerr

Which is read in using Java.io.File and Java.util.Scanner and stored as an Object of Term as such;

Reading in terms;

while (rawTerms.hasNextLine()){
    String[] tokens = rawTerms.nextLine().trim().split(delimiter);
    if (tokens.length == 2) {               
        uniqueSet.add(new Term(Double.parseDouble(tokens[0]), tokens[1])); //add the term to set
    }
    else {
      rawTerms.close();
      throw new Exception("Invalid member length: "+ tokens.length);
    }           
}

allTerms = new ArrayList<>(uniqueSet); //Covert set into an ArrayList

Term class using Guava;

public Term(double weight, String theTerm){
    this.weight = weight;
    this.theTerm = theTerm;
}


@Override
public boolean equals(final Object obj) {
    if (obj instanceof Term){
        final Term other = (Term) obj;
        return Objects.equal(this.weight, other.weight)
                && Objects.equal(this.theTerm, other.theTerm);
    }
    else {
        return false;
    }
}

@Override
public String toString(){
    return toStringHelper(this).addValue(weight)
            .addValue(theTerm).toString();

}

@Override  
public int hashCode() {  
    return Objects.hashCode(this.weight, this.theTerm);  
}

However, when I run a test to check the size of the array the entries are stored in, I get 3 entries instead of 1 which I am aiming for. I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate.

All help is appreciated!

Matt

Your formatting is very ... erratic. Please your autoformat in your IDE before posting; and ensure that your formatting is consistent. Also note that Egyptian brackets are preferred for Java. Finally, if you have a `return` you have no need for an `else`. — Boris the Spider, Nov 02 '16 at 14:18
@BoristheSpider I've never been to Egypt, nor do I import their brackets, but I've been programming for ages now. Kidding aside, I think curly brackets would invoke the desired idea in other's minds before Egyptian ones. :) — Edwin Buck, Nov 02 '16 at 14:21
Egyptian brackets _are_ curly brackets. It's about their placement. — Marko Topolnik, Nov 02 '16 at 14:24
@EdwinBuck curly brackets are the type of brackets (`{}`). [Egyptian brackets](https://blog.codinghorror.com/new-programming-jargon/) (3 in the link) are a style of formatting code blocks, preferred in Java over [K&R C style brackets](https://en.wikipedia.org/wiki/Indent_style#K.26R_style). — Boris the Spider, Nov 02 '16 at 14:24
@BoristheSpider Thank you for the explanation, but this is very new slang. There's also different definitions for the same item. Sounds like someone's trying to smith a word, and according to http://www.dodgycoder.net/2011/11/yoda-conditions-pokemon-exception.html Egyptian brackets are K&R style brackets. I'd say we already have K&R brackets to describe them. — Edwin Buck, Nov 02 '16 at 14:33
@EdwinBuck K&R uses both Egyptian and braces on their own line. It's a mixed style. — Marko Topolnik, Nov 02 '16 at 15:03
@BoristheSpider Thank you for your feedback on my formatting. Belphegor has been kind enough to help make it more readable. — Mathana Sreedaran, Nov 02 '16 at 15:51
Possible duplicate of [What issues should be considered when overriding equals and hashCode in Java?](http://stackoverflow.com/questions/27581/what-issues-should-be-considered-when-overriding-equals-and-hashcode-in-java) — Raedwald, Nov 07 '16 at 07:57

Jon Skeet · Accepted Answer · 2016-11-02T14:22:44.967

11

I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate.

That's not how equality works. Equality has to be transitive - so if x.equals(y) returns true, and y.equals(z) returns true, then x.equals(z) has to return true.

That's not the case in your desired relation.

Note that it's also not what your equals method checks at the moment:

return Objects.equal(this.weight, other.weight)
    && Objects.equal(this.theTerm, other.theTerm);

That only returns true if the weight and term match, which is normal for an equality relation. That's why you're getting three entries in your set - because when viewed in that way, you do have three distinct enties.

Fundamentally, HashSet and all the other collections dealing with equality won't help you in a simple way. You'll need to have three separate collections:

A set of weights
A set of terms
A set (or list) of entries.

If the entry you're considering has a weight in the set of weights or a term in the set of terms, you should skip it - otherwise, you should add an entry to each of the three collections.

edited Nov 02 '16 at 14:22

answered Nov 02 '16 at 14:17

Jon Skeet

1,421,763
867
9,128
9,194

The set of entries can then become a simple list. – Marko Topolnik Nov 02 '16 at 14:22
@MarkoTopolnik: Indeed. I wasn't sure whether to add that or not... will edit slightly. – Jon Skeet Nov 02 '16 at 14:22
I mean, since there is no custom equality defined on the `Term` anymore, each instance is in its own equality class. Therefore a set is nothing but overhead. – Marko Topolnik Nov 02 '16 at 14:23
@MarkoTopolnik: Well, the OP *may* want to keep the equality anyway, and they *may* want to keep it as a set to emphasize that ordering is irrelevant. – Jon Skeet Nov 02 '16 at 14:24
The static type can be `Collection` to show there are no ordering semantics implied. But... let's not descend into this. – Marko Topolnik Nov 02 '16 at 14:26

Mena · Answer 2 · 2016-11-02T14:17:46.443

6

Considering the implementation of hashCode (and equals) in the Term class, you should expect 3 entries, corresponding to the pairs involved:

999990  bummer
99999   bummer
999990  bummerr

Both hashCode and equals evaluate both properties of the pair, namely the weight double and the theTerm String.

The set will evaluate inequality by comparing hash codes, which will be different for the 3 elements listed above.

edited Nov 02 '16 at 14:17

answered Nov 02 '16 at 14:15

Mena

47,782
11
87
106

The question state that TS have duplicates. Your answer doesn't explain why it is happens. – talex Nov 02 '16 at 14:18
1

It doesn't answer what the OP is trying to achieve though: "I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate." – Jon Skeet Nov 02 '16 at 14:19
I am now aware that I should expect 3 entries. However, my aim is for 1 unique entry. Perhaps what @JonSkeet suggested below (using separate collections for each) will solve my issue? – Mathana Sreedaran Nov 02 '16 at 15:54
@MathanaSreedaran Jon Skeet's answers are certainly worth checking everytime. If I have to add anything here is: think of the criteria you need for the entries to be unique. For instance, the "difference" between `999990` and `99999` or `bummer` and `bummerr`. Since this can be hardly generalized, you might need to implement your own collection type that iterates the previous entries and checks each term of the pair against the terms of the pair you are trying to add. This may become very inefficient in terms of performance though. – Mena Nov 02 '16 at 16:01

HashSet not removing all duplicate entries

2 Answers2