What is an O(1)-search memory-efficient data structure to store pairs of integers?

Question

Consider this interface:

public interface CoordinateSet {
    boolean contains(int x, int y);
    default boolean contains(Coordinate coord) {
        return contains(coord.x, coord.y);
    }
}

It represents a set of 2-dimensional integer coordinates, and each possible coordinate may be either inside the set (contains returns true) or outside (contains returns false).

There are many ways we can implement such an interface. The most computationally efficient one would be the implementation backed up by an array:

public class ArrayCoordinateSet implements CoordinateSet {
    private final boolean[][] coords = new boolean[SIZE][SIZE];
    // ...
    @Override
    public boolean contains(int x, int y) {
        return coords[x][y];
    }
    public void add(int x,  int y) {
        coords[x][y] = true;
    }
    // ...

}

However, if SIZE is something large, say, 1000, and there are only, say, 4 cordinates that belong to the set, right in the four angles of a 1000×10000 rectangle, that means the absolute majority of cells space is consumed by false values. For such a sparse CoordinateSet we'd better be using a HashSet-based CoordinateSet:

public final class Coordinate {
    public final int x;
    public final int y;
    public Coordinate(int x, int y) {
        this.x = x;
        this.y = y;
    }
    // .equals() and hashCode()
}
public class HashBasedCoordinateSet implements CoordinateSet {
    private final Set<Coordinate> coords = new HashSet<>();
    @Override
    public boolean contains(int x, int y) {
        return coords.contains(new Coordinate(x, y));
    }
    @Override
    public boolean contains(Coordinate coord) {
         return coords.contains(coord);
    }
    public void add(Coordinate coord) {
        coords.add(coord);
    }
}

However, with the HashBasedCoordinateSet we have such an issue:

for (int x=0; x<1000; x++) {
  for (int y=0; y<1000; y++) {
    hashBasedCoordinateSet.contains(x, y);
  }
}

When we have values x and y and want to check if hashBasedCoordinateSet.contains(x, y), then that would require creating a new object at each method call (since we always need an object to search in a HashSet, it is not enough to just have object's data). And that would be a real waste of CPU time (it'd need to create all those Coordinate objects and then grabage-collect them, since seemngly no escape-analysis optimisation can be performed on this code).

So finally, my question is:

What would be the data structure to store a sparse set of coordinates that:

Has O(1) contains(int x, int y) operation;
Efficiently uses space (unlike the array-based implementation );
Does not have to create extra objects during contains(int x, int y)?

I doubt a native collection exists that satisfies your final constraint. You could avoid the intense object creation by passing in a mutable `Coordinate` object instead (although mutable objects have their own problems...) — Oliver Charlesworth, Oct 26 '14 at 12:55
@OliverCharlesworth I'm not asking for a JRE collection. A link to the description of the data structure to implement myself would be perfectly fine. — gvlasov, Oct 26 '14 at 12:58
You seem to be conflating abstract data structures with implementation details. The abstract data structure you want is a hash-map/set. — Oliver Charlesworth, Oct 26 '14 at 12:58
Have you measured and proven that a HashSet based solution was too slow? Creating short-lived objects is extremely fast. — JB Nizet, Oct 26 '14 at 13:01
Inherently, anything other than the array-based scheme will be O(log N) at best. And even the array-based scheme can be argued to be O(log N) when viewed as a circuit design problem, rather than assuming an existing computer with it built-in overhead due to a hard-wired max on N. — Hot Licks, Oct 26 '14 at 13:18
@HotLicks: Are you talking about the fact that you need logarithmic depth in the address demultiplexers? Whilst true, I'm not sure that's a particularly helpful way to view the problem from a software/algorithm point-of-view (as it basically means there's no such thing as an O(1) algorithm). — Oliver Charlesworth, Oct 26 '14 at 15:18
@OliverCharlesworth - You need logarithmic depth for the address demux, and you also have (at least) a logarithmic effect in the capacitance of line drivers and associated delays. — Hot Licks, Oct 26 '14 at 19:50

score 2 · Accepted Answer · answered Oct 26 '14 at 13:32

2

A long is twice the size of an integer in Java, so one can store two ints in one long. So how about this?

public class CoordinateSet {
    private HashSet<Long> coordinates = new HashSet<>();

    public void add(int x, int y) {
        coordinates.add((x | (long) y << 32));
    }

    public boolean contains(int x, int y) {
        return coordinates.contains((x | (long) y << 32));
    }
}

I am pretty sure the long on the contains method is stored on the stack.

answered Oct 26 '14 at 13:32

Lodewijk Bogaards

19,777
3
28
52

This seems to be the best answer so far. Especially when you use `TLongHashSet` from Trove instead of `HashSet`. – gvlasov Oct 26 '14 at 13:44
Yes, the combination is golden, since even unboxing can be prevented. – Lodewijk Bogaards Oct 26 '14 at 13:45
1

Be careful of sign extension. – Oliver Charlesworth Oct 26 '14 at 15:18
Yes it is: http://stackoverflow.com/questions/1055243/is-a-java-hashmap-really-o1 – Lodewijk Bogaards Oct 26 '14 at 16:34

Joey · Answer 2 · 2014-10-26T13:18:20.913

Optimizing without measuring is of course always dangerous. You probably should profile your app to see if that is really a bottleneck.

You also produce two usecases

Find a single coordinate in a set
Find all coordinates that are part of the the set in a given bound

Step 2 could be much more efficient by walking the iterator of the set, and filtering out the ones that you don't want. This might return the data in arbitrary order. And the performance is greatly dependent on how large the dataset will be.

Maybe a simple Table Datastructure, like the one provided by Guava, could give you a much nicer interface - indexing the X and Y coordinates as ints - while at the same time giving you O(1) access.

Table<Integer, Integer, Coordinate> index = HashBasedTable.create();

Another suggestion is to look into location sensitive hashing. You basically create a new hash function that maps your X-Y coordinates into a common one dimensional space that is easy to query. But this might be beyond the scope.

Note: traversing with the iterator will give you the elements in an arbitrary order, which may not be particularly useful. — Oliver Charlesworth, Oct 26 '14 at 13:02
Boxed integers would definitely be something to optimize. Are there any `IntIntTable` implementations? — gvlasov, Oct 26 '14 at 13:07
An `IntIntBooleanTable` would be exactly what I want, I guess. — gvlasov, Oct 26 '14 at 13:14

score 1 · Answer 3 · answered Oct 26 '14 at 13:36

If you want to have an O(1) data structure, you need to have a lookup mechanism which is independent of the actual values you want to store in the datastructure. The only way to do this is to enumerate your values and derive a formula to calculate the enumeration value of the pair you have, and then have an array of yes/no value for each enumeration value.

For instance, if you have that x is guaranteed to be between 0 and 79 and y is guaranteed to be between 0 and 24, you can use the enumeration formula y*80+x, which for the pair (10,10) would be 810. Then look up in the very large array of yes/no values if the value stored for 810 is a yes.

So, if you insist on having an O(1) algorithm, you need the space to hold the yes/no values.

gatkin · Answer 4 · 2014-10-26T14:02:49.487

You could try a binary tree, using the bits that make up the values of x and y as the key. For example, if x and y are 32-bit integers, the total depth of the tree is 64. So you loop through the bits of x and y, making at most 64 decisions to arrive at a contains/not-contains answer.

Update in response to comments: Granted, trees aren't what you normally think of if you want O(1), but keep in mind the array-based approach in the original question is only O(1) up to an implementation limit on available memory. All I'm doing is assuming the bit length of an integer is a fixed implementation constraint, which is generally a safe assumption. Put another way, if you really want the contains() call to run in constant time, you could code it to always do 64 comparison operations and then return.

Admittedly, a CS professor probably wouldn't buy that argument. Ever since we got rid of the homework tag I've had trouble knowing whether someone wants a real-world answer or a theoretical CS answer

binary tree is not O(1) – Lodewijk Bogaards Oct 26 '14 at 13:37 — Lodewijk Bogaards, Oct 26 '14 at 13:37
binary trees usually go in O(log(n)) (if properly balanced) – Thorbjørn Ravn Andersen Oct 26 '14 at 13:38 — Thorbjørn Ravn Andersen, Oct 26 '14 at 13:38

What is an O(1)-search memory-efficient data structure to store pairs of integers?

4 Answers4