4

If you were a highly skilled low-latency Java developer (I am not) and you were told to implement a set of int (primitive or not), would it be possible for you to get an extra performance gain given the guaranteed pre-condition that every new entry is higher than any other value previously stored in the set?

How significant that gain could be for the add, contains and remove operations in best/worst case scenarios?

On the one hand, it seems natural that such restriction would result in better performance. On the other hand, non-decreasing entries is a very common situation (e.g. in generating unique id) and if the gain were worth fighting for, then a more or less known implementation would have been already developed.

andbi
  • 4,426
  • 5
  • 45
  • 70
  • Asking about such a delicate form of possible performance gains, and then casually throwing in *"(primitive or not)*" seems contradictory. If performance is *really* critical, then you should only use the *primitive* type. There are several primitive collection libraries. – Marco13 Feb 20 '17 at 11:32
  • if you add integer, the complexity will be the same as 0 is stored in two bytes and Integer.MAX is also two bytes. – nafas Feb 20 '17 at 11:32
  • @Marco13 Of course primitive sets will behave better and I will prefer them over other implementations, but the question is rather a theoretical one. – andbi Feb 20 '17 at 11:43
  • Consider using [BitSet](http://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html). This has the advantage of not using boxed integer types. The equivalents of `Set.add()` and `Set.contains()` are respectively `BitSet.set(int)` and `BitSet.get(int)`. (This doesn't answer your question of how to benefit from knowing that adds will be in increasing order.) – Klitos Kyriacou Feb 20 '17 at 12:01

1 Answers1

3

When you check this question, you find that add and contains is already O(1). So there is not much to improve there.

And I think that those two would be the only once could benefit from this constraint:

  • "adding" becomes easier because you could simply remember the last value that was added; so only need one check when a new value is coming in
  • similarly, when asking for "contained"; you have a first pre-check that tells you instantly when a given value can not be in the set

But that is about it.

And beyond that: when your constraint is really that each "new" entry that is about to be added is larger than the last one - then you don't need a Set in the first place. Because your constraint guarantees that all items will be unique. So in that sense, you could be looking into Lists, too ...

Regarding the comment that the question is asking between possible deltas between O(1) and O(1.5); my response:

The difference between O(1) and O(n) is of theoretical nature, you answer that using a pen and piece of paper. The difference between O(1.0) and O(1.005) ... there I would start with experiments and benchmarks.

Meaning: these "real" factors depend on various elements which are "close" to the underlying implementation. You would start by looking into how the Set that you are currently using is implemented for your platform; and how the JVM on your platform is doing its just-in-time-compiling. From there on, you could draw conclusions about things that could be improved by taking this constraint into consideration.

Finally; regarding the constraint degrading existing implementations. I guess this could also happen; as said above: such details really depend on the specific implementation. And beyond that: you named three different operations; and actual results can be very much different; depending on the operation type.

If I had to work on this problem; I would start by doing creating reasonably large files with "test data" (random numbers, increasing-only numbers; and variations of that). And then I would use a real profiler (or at least sophisticated benchmarking) and start measuring.

Community
  • 1
  • 1
GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • 1
    I know they're all O(1), however there is a significant difference between, say, 1.5 and 1.0, that's the gain I'm after. – andbi Feb 20 '17 at 11:33
  • 1
    I think that "low-latency" implies that the *asymptotic* complexity is not the key point here. It may rather be about the number of nanoseconds of one `add` operation according to a JMH benchmark (and, likely, the possible garbage collection times that are implied by using boxed `Integer` objects!) – Marco13 Feb 20 '17 at 11:34
  • Missed the update to your answer, please disregard my first comment. – andbi Feb 20 '17 at 11:35
  • Thanks, I got your point about measuring, but is it safe to say that such pre-condition will not degrade the performance? – andbi Feb 20 '17 at 11:56