3

I need a space efficient collection to store a large list of primitive int(s)(with around 800,000 ints), that allows for fast operations for contains() & allows for iteration in a defined order.

Faster contains() operations to check whether an int is there in the list or not, is main priority as that is done very frequently.


I'm open to using widely used & popular 3rd party libraries like Trove, Guava & such others.

I have looked at TIntSet from Trove but I believe that would not let me define the order of iteration anyhow.

Edit:

The size of collection would be around 800,000 ints. The range of values in the collection will be from 0 to Integer.Max_VALUE. The order of iteration should be actually based on the order in which I add the value to collection or may be I just provide an ordered int[] & it should iterate in the same order.

Rajat Gupta
  • 25,853
  • 63
  • 179
  • 294
  • 2
    How large is "large"? What is the expected range of values? How tightly / sparsely populated the range is? – Péter Török Apr 26 '12 at 11:31
  • @user - unless you state your requirements with specifity, you are unlikely to get useful answers. – Stephen C Apr 26 '12 at 11:36
  • Thank you all, I added some edits to the question. Hopefully this should clarify my needs. – Rajat Gupta Apr 26 '12 at 11:41
  • 4
    "fast" and "space efficient" are often two things you can't have both. – Kai Apr 26 '12 at 11:42
  • Do your integers have any meaning? For example, would you be able to use a bloom filter to tell if a candidate "might" be in the Collection before going to find out for sure? – Ray Apr 26 '12 at 12:01
  • @Ray: Actually I need to add an `int` to the collection only if it is not already present(ie, preventing duplicates & overwrites). I may be able to use bloom filters if they can provide an overall better performance for `contains()` operations – Rajat Gupta Apr 26 '12 at 12:05

5 Answers5

5

As data structure I would choose an array of longs (which I logically treat as two ints). The high-int part (bits 63 - 32) represent the int value you add to the collection. The low-int part (bits 31 - 0) represents the index of the successor when iterating. In case of your 800.000 unique integers you need to create a long array of size 800.000.

Now you organize the array as a binary balanced tree ordered by your values. To the left the smaller values and to the right the higher values. You need two more tracking values: one int to point to the first index to start iterating at and one int to point to the index of the value inserted last.

Whenever you add a new value, reorganize your binary balanced tree and update the pointer from the last value added pointing to the currently added value (as indexes).

Wrap this values (the array and both int values) as the collection of your choice.

With this data structure you get a search performance of O(log(n)) and a memory usage of two times the size of values.

Harmlezz
  • 7,972
  • 27
  • 35
  • I don't get it. How should they "organize the array as a binary balanced tree"? This requires either explicit or implicit pointers. With explicit ones you'd need much more memory. With implicit ones, adding a new value may get very complicated and inefficient. What am I missing? – maaartinus May 17 '12 at 12:46
3

As this reeks of database, but you require a more direct approach, use a memory mapped file of java.nio. Especially a self-defined ordering of 800_000 ints will not do otherwise. The contains could be realized with a BitSet in memory though, parallel to the ordering in the file.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • If the iteration order is the natural ordering, a bit set alone could do the job in a space efficient manner. It would be very fast for contains. Iteration would not be quite as fast as, say, a LinkedList, but still pretty quick, and certainly an improvement over a file. – Ray Apr 26 '12 at 12:06
1

You can use 2 Sets one set is set based on hash (e.g. TIntSet) for fast contains operations. Another is set based on tree structure like TreeSet to iterate in speicific order.
And when you need to add int, you update both sets at the same time.

Mikita Belahlazau
  • 15,326
  • 2
  • 38
  • 43
0

It sounds like LinkedHashSet might be what you're looking for. Internally, it maintains two structures--a HashSet and a LinkedList, allowing for both fast 'contains()' (from the former) and defined iteration order (from the latter).

Ray
  • 4,829
  • 4
  • 28
  • 55
-1

Just use a ArrayList<Integer>.

Kai
  • 38,985
  • 14
  • 88
  • 103
  • 2
    The original question was for a space efficient solution. LinkedList adds at least two pointer references of overhead for each integer (might be three, need to check the source). In other words, it takes three times as much space as an "int[]". – Torben Apr 26 '12 at 11:33
  • 2
    @Torben Given these requirements you can't discard a solution merely because it has some extra bytes of overhead per stored value. If you don't accept any overhead, you basically only have two options. You can have int array organized by insertion order ==> fast iteration but slow (`O(n)`) contains operations. Or you can have sorted by value int array which gives faster (`O(log(n)`) contains operations, but insertion order iteration will be impossible. I'm not saying that LinkedList (or ArrayList) are good solutions. I'm saying is that you can't discard solutions because they have overhead. – Alderath Apr 26 '12 at 12:20
  • 2
    The question implies very strongly that the OP already knows about LinkedList and other classes in java.util package. Also, LinkedList can not guarantee anything but O(N) performance for contains(). Therefore one can safely assume that it can not be a useful solution. – Torben Apr 26 '12 at 12:37