0

I am attempting to use a HashSet to make sure data I read in from a .txt file are unique.

Below is the sample data;

999990  bummer
999990  bummer
999990  bummer
999990  bummer
99999   bummer
999990  bummerr

Which is read in using Java.io.File and Java.util.Scanner and stored as an Object of Term as such;

Reading in terms;

while (rawTerms.hasNextLine()){
    String[] tokens = rawTerms.nextLine().trim().split(delimiter);
    if (tokens.length == 2) {               
        uniqueSet.add(new Term(Double.parseDouble(tokens[0]), tokens[1])); //add the term to set
    }
    else {
      rawTerms.close();
      throw new Exception("Invalid member length: "+ tokens.length);
    }           
}

allTerms = new ArrayList<>(uniqueSet); //Covert set into an ArrayList

Term class using Guava;

public Term(double weight, String theTerm){
    this.weight = weight;
    this.theTerm = theTerm;
}


@Override
public boolean equals(final Object obj) {
    if (obj instanceof Term){
        final Term other = (Term) obj;
        return Objects.equal(this.weight, other.weight)
                && Objects.equal(this.theTerm, other.theTerm);
    }
    else {
        return false;
    }
}

@Override
public String toString(){
    return toStringHelper(this).addValue(weight)
            .addValue(theTerm).toString();

}

@Override  
public int hashCode() {  
    return Objects.hashCode(this.weight, this.theTerm);  
}

However, when I run a test to check the size of the array the entries are stored in, I get 3 entries instead of 1 which I am aiming for. I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate.

All help is appreciated!

Matt

Belphegor
  • 4,456
  • 11
  • 34
  • 59
  • What is `uniqueSet`? – talex Nov 02 '16 at 14:16
  • Your formatting is very ... erratic. Please your autoformat in your IDE before posting; and ensure that your formatting is consistent. Also note that Egyptian brackets are preferred for Java. Finally, if you have a `return` you have no need for an `else`. – Boris the Spider Nov 02 '16 at 14:18
  • @BoristheSpider I've never been to Egypt, nor do I import their brackets, but I've been programming for ages now. Kidding aside, I think curly brackets would invoke the desired idea in other's minds before Egyptian ones. :) – Edwin Buck Nov 02 '16 at 14:21
  • Egyptian brackets _are_ curly brackets. It's about their placement. – Marko Topolnik Nov 02 '16 at 14:24
  • @EdwinBuck curly brackets are the type of brackets (`{}`). [Egyptian brackets](https://blog.codinghorror.com/new-programming-jargon/) (3 in the link) are a style of formatting code blocks, preferred in Java over [K&R C style brackets](https://en.wikipedia.org/wiki/Indent_style#K.26R_style). – Boris the Spider Nov 02 '16 at 14:24
  • @BoristheSpider Thank you for the explanation, but this is very new slang. There's also different definitions for the same item. Sounds like someone's trying to smith a word, and according to http://www.dodgycoder.net/2011/11/yoda-conditions-pokemon-exception.html Egyptian brackets are K&R style brackets. I'd say we already have K&R brackets to describe them. – Edwin Buck Nov 02 '16 at 14:33
  • @EdwinBuck K&R uses both Egyptian and braces on their own line. It's a mixed style. – Marko Topolnik Nov 02 '16 at 15:03
  • @BoristheSpider Thank you for your feedback on my formatting. Belphegor has been kind enough to help make it more readable. – Mathana Sreedaran Nov 02 '16 at 15:51
  • @talex uniqueSet is the name I have given to my HashSet – Mathana Sreedaran Nov 02 '16 at 15:52
  • Possible duplicate of [What issues should be considered when overriding equals and hashCode in Java?](http://stackoverflow.com/questions/27581/what-issues-should-be-considered-when-overriding-equals-and-hashcode-in-java) – Raedwald Nov 07 '16 at 07:57

2 Answers2

11

I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate.

That's not how equality works. Equality has to be transitive - so if x.equals(y) returns true, and y.equals(z) returns true, then x.equals(z) has to return true.

That's not the case in your desired relation.

Note that it's also not what your equals method checks at the moment:

return Objects.equal(this.weight, other.weight)
    && Objects.equal(this.theTerm, other.theTerm);

That only returns true if the weight and term match, which is normal for an equality relation. That's why you're getting three entries in your set - because when viewed in that way, you do have three distinct enties.

Fundamentally, HashSet and all the other collections dealing with equality won't help you in a simple way. You'll need to have three separate collections:

  • A set of weights
  • A set of terms
  • A set (or list) of entries.

If the entry you're considering has a weight in the set of weights or a term in the set of terms, you should skip it - otherwise, you should add an entry to each of the three collections.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
6

Considering the implementation of hashCode (and equals) in the Term class, you should expect 3 entries, corresponding to the pairs involved:

999990  bummer
99999   bummer
999990  bummerr

Both hashCode and equals evaluate both properties of the pair, namely the weight double and the theTerm String.

The set will evaluate inequality by comparing hash codes, which will be different for the 3 elements listed above.

Mena
  • 47,782
  • 11
  • 87
  • 106
  • The question state that TS have duplicates. Your answer doesn't explain why it is happens. – talex Nov 02 '16 at 14:18
  • 1
    It doesn't answer what the OP is trying to achieve though: "I would like any new entry with either the same weight or term as previously added entries to be considered a duplicate." – Jon Skeet Nov 02 '16 at 14:19
  • I am now aware that I should expect 3 entries. However, my aim is for 1 unique entry. Perhaps what @JonSkeet suggested below (using separate collections for each) will solve my issue? – Mathana Sreedaran Nov 02 '16 at 15:54
  • @MathanaSreedaran Jon Skeet's answers are certainly worth checking everytime. If I have to add anything here is: think of the criteria you need for the entries to be unique. For instance, the "difference" between `999990` and `99999` or `bummer` and `bummerr`. Since this can be hardly generalized, you might need to implement your own collection type that iterates the previous entries and checks each term of the pair against the terms of the pair you are trying to add. This may become very inefficient in terms of performance though. – Mena Nov 02 '16 at 16:01