1

I need to perform a check if the combination of a long value and an integer value were already seen before in a very performance-critical part of an application. Both values can become quite large, at least the long will use more than MAX_INT values in some cases.

Currently I have a very simple implementation using a Set<Pair<Integer, Long>>, however this will require too many allocations, because even when the object is already in the set, something like seen.add(Pair.of(i, l)) to add/check existence would allocate the Pair for each call.

Is there a better way in Java (without libraries like Guava, Trove or Apache Commons), to do this check with minimal allocations and in good O(?)?

Two ints would be easy because I could combine them into one long in the Set, but the long cannot be avoided here.

Any suggestions?

centic
  • 15,565
  • 9
  • 68
  • 125
  • *Suggestion:* Create a class with 2 fields, an `int` and a `long`. Implement `equals` and `hashCode`. – Andreas Sep 27 '18 at 20:10
  • "Better way" tends to be dependent on context, on your particular problem. I would look at making my own set that could store long-long values directly, with out the need to create a `Pair`. But that's just a guess without knowing anything about your performance characteristics. – markspace Sep 27 '18 at 20:12
  • @markspace or a way to generate unique number from two other numbers – Eugene Sep 27 '18 at 20:51
  • @Eugene but that's clearly impossible for a large enough set of input numbers. You can't generate a unique `int` for every possible pair of `long`s. So if we knew something about the size of the set needed (a few dozen entries? A few million?) we'd be in a better position to make a recommendation. – markspace Sep 27 '18 at 21:53
  • The longs can become larger than MAX_INT, so a simple combination would not work. – centic Sep 28 '18 at 05:13
  • To reduce number of allocations you can use the same `Pair` everytime using a single instance pattern (`Singleton`) or if you're on a multithreaded paradigm you can use a pool of `Pair`s. This will reduce allocations to their (mathematical) minimum. – simo-r Sep 28 '18 at 06:43

3 Answers3

1

How about creating a class that holds two primitives instead? You would drop at least 24 bytes just for the headers of Integer and Long in a 64 bit JVM.

Under this conditions you are looking for a Pairing Function, or generate an unique number from 2 numbers. That wikipeia page has a very good example (and simple) of one such possibility.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • Yes, it would reduce memory a bit, but would still require allocation just for the check for existence. – centic Sep 27 '18 at 20:13
  • 1
    @centic how about maintaining two `Set`(s) than? – Eugene Sep 27 '18 at 20:15
  • I need to make the combination of both unique, not each of them. – centic Sep 27 '18 at 20:16
  • @centic it seems you are looking for a [pairing function](https://en.wikipedia.org/wiki/Pairing_function) and one is provided there that is fairly trivial to apply to your example – Eugene Sep 27 '18 at 20:33
  • @Eugene that wouldn't work out so well because the largest primitive type holds only 64 bits, yet an `int` and a `long` together is 96 bits. Also, bit shifting is a much faster pairing function when all or all but one naturals are bounded and you know the maximum number of bits needed for each bounded natural number. – Chai T. Rex Sep 28 '18 at 06:37
  • @ChaiT.Rex I understand that, I was really hoping that those values would be bounded to some lesser value, too bad it's not – Eugene Sep 28 '18 at 08:29
1

Here are two possibilities.

One thing in both of the following suggestions is to store a bunch of pairs together as triple ints in an int[]. The first int would be the int and the next two ints would be the upper and lower half of the long.

If you didn't mind a 33% extra space disadvantage in exchange for an addressing speed advantage, you could use a long[] instead and store the int and long in separate indexes.

You'd never call an equals method. You'd just compare the three ints with three other ints, which would be very fast. You'd never call a compareTo method. You'd just do a custom lexicographic comparison of the three ints, which would be very fast.

B* tree

If memory usage is the ultimate concern, you can make a B* tree using an int[][] or an ArrayList<int[]>. B* trees are relatively quick and fairly compact.

There are also other types of B-trees that might be more appropriate to your particular use case.

Custom hash set

You can also implement a custom hash set with a custom, fast-calculated hash function (perhaps XOR the int and the upper and lower halves of the long together, which will be very fast) rather than relying on the hashCode method.

You'd have to figure out how to implement the int[] buckets to best suit the performance of your application. For example, how do you want to convert your custom hash code into a bucket number? Do you want to rebucket everything when the buckets start getting too many elements? And so on.

Chai T. Rex
  • 2,972
  • 1
  • 15
  • 33
  • in java an `int[]` is still an Object, you still pay for the headers (at least) in has; it also has an additional header that keeps the size – Eugene Sep 28 '18 at 08:28
  • Right. But you have to store multiple values anyway, and the only way to do that is with objects. Arrays are some of the most compact representations. – Chai T. Rex Sep 28 '18 at 08:30
  • that *is* the entire point, the OP does not (or can't) allocate those objects and I would not say that they are the *most* compact without measuring, which I will shortly do. – Eugene Sep 28 '18 at 08:31
  • Take care that, when measuring, you measure many, many `int`s stored, not just three (so, for example, `new int[300]` vs 100 objects). Storing them in bulk is the whole reason it's more space-efficient (and cache efficient as well, as they're laid out right next to each other). – Chai T. Rex Sep 28 '18 at 08:34
  • a `Holder` having an `int` and a `long` has `24 bytes` on a 64 bit JVM, an `int[]` holding 3 `int` has `32 bytes` – Eugene Sep 28 '18 at 08:36
  • you are right in general, I agree, but not in the OP case, where he actually needs a separate Object/array to get the unique part. – Eugene Sep 28 '18 at 08:37
  • You don't actually need that if you're making your own custom collection (which, unlike Java's collections, don't require objects if you don't make them require objects). You can add a method `boolean existsInCollection(int theInt, long theLong)`, and compare directly to the arrays. – Chai T. Rex Sep 28 '18 at 08:41
  • you probably have some implementation in your mind that I don't know about, neither can I read minds... even a scatch of that placed into the answer might be helpful... – Eugene Sep 28 '18 at 08:43
  • It's in the paragraph above that starts with "You'd never call an `equals` method." – Chai T. Rex Sep 28 '18 at 08:44
  • even if this is interesting, it's a bit lengthy. You don't need a custom collection, `Trove` already has collection of primitives, they might be even coming to java in the nearest future – Eugene Sep 28 '18 at 08:45
  • `Trove` is not a language, it's a library for java – Eugene Sep 28 '18 at 08:47
  • :) well if we got so far, I would suggest buying more memory; it would be cheaper. – Eugene Sep 28 '18 at 08:49
0

How about

class Pair {
    int v1;
    long v2;

    @Override
    public boolean equals(Object o) {
        return v1 == ((Pair) o).v1 && v2 == ((Pair) o).v2;
    }

    @Override
    public int hashCode() {
        return 31 * (31 + Integer.hashCode(v1)) + Long.hashCode(v2);
    }
}

class Store {
    // initial capacity should be tweaked
    private static final Set<Pair> store = new HashSet<>(100*1024);
    private static final ThreadLocal<Pair> threadPairUsedForContains = new ThreadLocal<>();

    void init() { // each thread has to call init() first
        threadPairUsedForContains.set(new Pair());
    }

    boolean contains(int v1, long v2) { // zero allocation contains()
        Pair pair = threadPairUsedForContains.get();
        pair.v1 = v1;
        pair.v2 = v2;
        return store.contains(pair);
    }

    void add(int v1, long v2) {
        Pair pair = new Pair();
        pair.v1 = v1;
        pair.v2 = v2;
        store.add(pair);
    }
}
Adam Siemion
  • 15,569
  • 7
  • 58
  • 92