3

I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.

Some criteria I've set for myself:

  • write once, read many times
  • contain 100 to 1000 int
  • be efficient: order of magnitude O(1)
  • be memory efficient (size of the ints + some housekeeping bits ideally)
  • implemented in pure Java (no libraries for this, as I want to learn)
  • items are unique
  • no concurrency requirements
  • ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).

This is in Java, and is mostly theoretical, as I've started using the solution described below.

Things I've considered:

  • LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
  • int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.

What I'm using now is a combination of int[] and HashMap:

  • HashMap for retrieving index of a specific int in the int[]
  • int[] for retrieving the neighbors of that int

What I like:

  • neighbor lookup is ideally O(2)
  • int[] does not do boxing
  • performance is theoretically very good

What I dislike:

  • HashMap does boxing twice (key and value)
  • the ints are stored twice (in both the map and the array)
  • theoretical memory use could be improved quite a bit

I'd be curious to hear of better solutions.

Marceau
  • 1,643
  • 4
  • 17
  • 27
  • *What I'm using now is a combination of int[] and HashMap*: can you show what have you tried alread? – Jordi Castilla May 20 '15 at 12:29
  • 2
    How is retrieval of an `int[]` item O(n)? Searching for an item is, but retrieving is a constant-time operation. – Anders R. Bystrup May 20 '15 at 12:29
  • 2
    Retrieval is O(n) for `int[]`? Don't think so – kaykay May 20 '15 at 12:30
  • Is there a range for the `int` values which you need to store? Can they by anything between MIN_INT and MAX_INT or less? – Aaron Digulla May 20 '15 at 12:33
  • Edited for clarification in question: when I talk about int[] and O(n), I do mean search: I have the item only, and not its index in the array, so I need a search first, which requires at most O(n) time. I believe this was clear from the rest of my question though. – Marceau May 20 '15 at 12:37
  • @AaronDigulla: I would be fine with positive integers only (0 < int <= MAX_INT) – Marceau May 20 '15 at 12:39
  • @Zaan If you must search the `int[]` it is only O(n) in *worst case*. Average case is O(log(n)) if you ensure sort order on insert. – MadConan May 20 '15 at 12:58
  • Are you concerned about memory? Keep in mind, that it is usually not an issue on any newer machine, unless you are on some limited architecture (phone, ARM, etc) *and* working with large data sets. – MadConan May 20 '15 at 13:01
  • @MadConan perhaps I'm missing something, but as my primary requirement is to find neighbors (which are the consequence of an external sort), sorting the ints on their value is not an option. – Marceau May 20 '15 at 13:02
  • @MadConan not concerned, I'd say I'm interested as to what may theoretically be possible – Marceau May 20 '15 at 13:03
  • @Zaan I don't understand the requirement about finding neighbors making sorting on value not an option. What do these integers represent? – MadConan May 20 '15 at 13:05
  • @MadConan I've tried to add a clarification to my question: the integers are database ids of items that are filtered and sorted on a variety of fields: the integers are not sorted naturally, but their order must be preserved. – Marceau May 20 '15 at 13:10

2 Answers2

1

One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).

The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).

If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.

More options: You could look at bit sets but that again depends on the range which the values can have.

Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java but the type is internal, so you'd have to copy the code to use it.

That means you don't need to autobox anything (unboxing is cheap).

Related:

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
0

ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).

This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.

EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.

Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.

Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.

MadConan
  • 3,749
  • 1
  • 16
  • 27
  • "array. The complexity of managing the size isn't worth it on your specified data set size." Okay, but as this is write-once, I would think managing size is not actually an issue (at least not a defining one). Also, perhaps I'm mistaken (again), but doesn't a tree suggest natural ordering? – Marceau May 20 '15 at 13:17
  • It can, but it's not required. – MadConan May 20 '15 at 13:23
  • @Zaan When you say "write once" I'm see that as writing the value once. If there are 1000 values, that's 1000 writes. Without knowing all of the specifics, it's hard to say for sure what you need. But using raw arrays is generally not worth the trouble. – MadConan May 20 '15 at 13:28