Is there a data structure that only stores hash codes and not the actual objects?

Question

My use-case is that I'm looking for a data structure in Java that will let me see if an object with the same hash code is inside (by calling contains()), but I will never need to iterate through the elements or retrieve the actual objects. A HashSet is close, but from my understanding, it still contains references to the actual objects, and that would be a waste of memory since I won't ever need the contents of the actual objects. The best option I can think of is a HashSet of type Integer storing only the hash codes, but I'm wondering if there is a built-in data structure that would accomplish the same thing (and only accept one type as opposed to HashSet of type Integer which will accept the hash code of any object).

Is your hash function perfect? Or can you have multiple objects with the same hash value? — arshajii, Mar 15 '19 at 21:17
The `HashSet` will contain a _reference_ to your object, not a _copy_, so don't worry about space. A `HashSet` would probably use up more space because it has references to integers. — Sweeper, Mar 15 '19 at 21:20
I agree with @Sweeper, unless you have a real need for super-duper optimization. Also, your second idea with storing hashcodes as integer wouln't be more efficient as it would store the hash+the hash of the hash. — Joel, Mar 15 '19 at 21:28
@Sweeper The HashSet uses internally a HashMap. The memory space is the same. — Octavian R., Mar 15 '19 at 21:28
I agree also with @Sweeper. A Set has to have a reference to something so it can hash it. I do see what you're getting at. I don't know of a structure that stores only hashes so you can just check for existence of a hash by itself, without associating an object reference with that hash. But it has to be possible. Maybe there's one in some library somewhere. - if memory is of critical importance, at the cost of some lookup speed, the most compact way to store what you're looking for would be a sorted array looked up by a binary search. This would be slow for inserts though. — CryptoFool, Mar 15 '19 at 21:34
@Steve also the reason to keep the associated object is that a hash may not be unique, collisions are possible, in which case the "equals" function is called iirc — Joel, Mar 15 '19 at 21:51
@Joel. Good point for the whole discussion. But I was thinking that he'd already reasoned away collisions, since he must do so if HashSet could work for him. — CryptoFool, Mar 15 '19 at 22:40
I've gone full circle on this very interesting question. I had misread it initially. So if I understand it right, the problem is of keeping a Set so you can ask "Is this object in the set?" without keeping every represented object in memory. The HashSet solution is imperfect if you can have hash collisions, because you are by definition reducing every object down to only its hash. So unless collisiosn are ok, the problem really comes down to how to as concisely as possible represent the objects uniquely with no collisions. That hasn't yet been addressed. That seems domain specific — CryptoFool, Mar 15 '19 at 22:55
Once you have a truly unique identifier/hash for each object, then we can all quibble over if we can save 4-16 bytes by not storing a reference (4 or 8 bytes) to an extra Integer (4 bytes) object for each entry in the Set. That's all I thought the question was about initially, and is still worth discussing. — CryptoFool, Mar 15 '19 at 23:07
Do you care of search performance? If not than just list of hash values in array/list would give you good memory usage if hashes to sparse for complete bit set (giving you potentially insanely slow O(n) `Contains` vs. using HashSet which has O(1) lookup) — Alexei Levenkov, Mar 16 '19 at 01:59

Andy Thomas · Answer 1 · 2019-03-17T23:25:38.030

12

A Bloom filter can tell whether an object might be a member, or is definitely not a member. You can control the likelihood of false positives. Each hash value maps to a single bit.

The Guava library provides an implementation in Java.

edited Mar 17 '19 at 23:25

answered Mar 15 '19 at 21:34

Andy Thomas

84,978
11
107
151

Nice. This seems like the solution for very low storage overhead. But you have to worry about the false negative case. If you can statistically eliminate that, this is great! – CryptoFool Mar 15 '19 at 21:45
1

False positives, but you can control their probability. Another disadvantage is that you can't remove elements. – Andy Thomas Mar 15 '19 at 21:52
The question was for a data structure that checks using only the predefined `hashCode()`, which can potentially have 2^31 values (counting only positives). A Bloom filter that uses one hash function with 2 ^ 31 possible values would be extraordinarily large, seeing as it is basically just a BitSet. I don't see how that counts as "very low storage overhead". – Leo Aso Mar 15 '19 at 22:02
@LeoAso - The relationship between the set of hash values and the set of bits is not necessary one-to-one. – Andy Thomas Mar 17 '19 at 14:24

score 2 · Answer 2 · answered Mar 15 '19 at 22:22

2

You could use a primitive collection implementation like IntSet to store values of hash codes. Obviously as others have mentioned this assumes collisions aren't a problem.

answered Mar 15 '19 at 22:22

Mark

28,783
8
63
92

LuCio · Answer 3 · 2019-03-15T22:29:00.607

1

If you want to track if a hash code is already present and to do it memory efficient a BitSet may suite your requirements.

Look at the following example:

  public static void main(String[] args) {
    BitSet hashCodes = new BitSet();
    hashCodes.set("1".hashCode());

    System.out.println(hashCodes.get("1".hashCode())); // true
    System.out.println(hashCodes.get("2".hashCode())); // false
  }

The BitSet "implements a vector of bits that grows as needed.". It's a JDK "built-in data structure" which doesn't contain "references to the actual objects". It stores only if "the same hash code is inside".

EDIT:
As @Steve mentioned in his comment the implementation of the BitSet isn't the most memory efficient one. But there are more memory efficient implementations of a bit set - though not built-in.

edited Mar 15 '19 at 22:29

answered Mar 15 '19 at 21:38

LuCio

5,055
2
18
34

I don't know how a BitSet stores individual bits. Since obviously this usage will spread bits across a very large input domain, are you sure those are stored efficiently? Just asking. The naive assumption is that the structure would be an array of bytes, where the array was just expanded to include any bit position and all positions before that, which would be monstrously inefficient. But I don't know how it actually represents bits spread way apart. – CryptoFool Mar 15 '19 at 22:04
It appears that your solution won't work. See https://github.com/brettwooldridge/SparseBitSet – CryptoFool Mar 15 '19 at 22:07
@Steve You're right. Found additionally this [post](https://stackoverflow.com/a/16036285/2838289). But the idea of an bit set is basically not bad. It's rather the implementation of the JDK `BitSet`. – LuCio Mar 15 '19 at 22:12
It would be pretty efficient when about half of the possible values used... (unlikely in practice but still...) – Alexei Levenkov Mar 16 '19 at 02:02

score -1 · Answer 4 · answered Mar 15 '19 at 21:40

There is no such built-in data structure, because such a data structure is rarely needed. It's easy to build one, though.

public class HashCodeSet<T> {

    private final HashSet<Integer> hashCodes;        

    public MyHashSet() {
        hashCodes = new HashSet<>();
    }         

    public MyHashSet(int initialCapacity) {
        hashCodes = new HashSet<>(initialCapacity);
    }         

    public HashCodeSet(HashCodeSet toCopy) {
        hashCodes = new HashSet<>(toCopy.hashCodes);
    } 

    public void add(T element) {
       hashCodes.add(element.hashCode());
    }

    public boolean containsHashCodeOf(T element) {
       return hashCodes.contains(element.hashCode());
    }        

    @Override
    public boolean equals(o: Object) {
        return o == this || o instanceof HashCodeSet && 
                ((HashCodeSet) o).hashCodes.equals(hashCodes);
    }        

    @Override
    public int hashCode() {
        return hashCodes.hashCode(); // hash-ception
    } 

    @Override
    public String toString() {
        return hashCodes.toString();
    }
}

I think the OP's question wasn't about API, but about memory usage. This doesn't help with that since the result still acts like a HashSet. — CryptoFool, Mar 15 '19 at 21:42
I realize that I was unfair completely on this. I've removed my objections. Feel free to delete your comments on my comments. I still think there's something more to the problem, but I was off base. Sorry — CryptoFool, Mar 15 '19 at 22:28

Is there a data structure that only stores hash codes and not the actual objects?

4 Answers4