I have a table with sorted numbers like:
1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
:
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312
Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).
The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.
But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.
A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.
Do you know of a data structure that will give me:
- O(log n) for access
- space requirement < 64-bit/value
= Edit =
Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.
Then create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
Then a delta table for all records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.
As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)
lookup: O(1 + 1024)
(99% and 1024 may have to be adjusted)
= Edit2 =
Based on the idea above, but wasting less space. Create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
And for all value that cannot be represented by modest-bits create big-value table as a tree:
64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value
Then a delta table for all records, that is reset for every 1024 records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
but also reset for every value that is in the big-value table.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).
Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.
lookup: O(log(big-value table) + 1 + 1024) = O(log n)
Can you improve this? Or do better in a different way?