2

I have to store 500M two digit unicode character in memory (RAM).

The data structure I use should have:

Worst Case Space Complexity: O(n)
Worst Case Time Complexity: O(1) <-- insertion, read, update, deletion

I was thinking of choosing dict which is implementation of hash in python but then problem is it assures time complexity of O(1) for the required operations only in average cases than worst case.

I heard that if number of entries is known, time complexity of O(1) can be achieved in worst case scenarios.

How todo that?

In case, thats not possible in python can I access memory addresses and the data onto it directly in my python code? If yes, then how?

codersofthedark
  • 9,183
  • 8
  • 45
  • 70
  • Is it possible to test with a dictionary? Time complexity might not really be a problem. – msvalkon Mar 03 '13 at 23:01
  • @msvalkon: do you want to say practically I wont hit worst case scenario at all? – codersofthedark Mar 03 '13 at 23:05
  • The likelihood for the `O(n)` case is directly related to the number of hash collisions and therefore, the size of the `dict`. About how many objects are you putting in the `dict`? – Joel Cornett Mar 03 '13 at 23:14
  • What kind of characters are these? UTF-16 or UTF-8? And is it likely that the vast majority of the two-character combinations will be of a small set of characters (alphanumeric characters, for instance?) – David Robinson Mar 04 '13 at 01:05
  • @JoelCornett: 500M and even a single failure is not tolerable (hard time critical problem) – codersofthedark Mar 04 '13 at 20:38
  • @DavidRobinson: UTF-16, dint get the other part of your question – codersofthedark Mar 04 '13 at 20:40
  • Are all two-character pairs (of which there are 2^32) roughly equally likely? Or do some pairs of characters occur more frequently? If it is from any kind of text at all, some pairs (those of alphanumeric characters) will be far, far more likely than pairs that consist of two arbitrary, obscure characters – David Robinson Mar 04 '13 at 23:13
  • 3
    @UtxD: So if `n == 2`, that's unacceptable? I find that hard to believe. Let's assume an even distribution for your objects and assume that the size of the hashtable is 2/3 the size of the # of elements. So, `n = 500M` and `k = 333.33M`. The probability of any one element being assigned to a given hash is `1/k`. The probability of all 500 million elements being assigned to that hash slot is `(1/k)^500M`. Which is a *very, very small number*. I may be wrong, but it is more likely that an asteroid impact will destroy your hard drive before this happens. – Joel Cornett Mar 05 '13 at 05:13
  • @JoelCornett: Sounds fine. :) – codersofthedark Mar 05 '13 at 08:36
  • @DavidRobinson: All pairs r unique – codersofthedark Mar 05 '13 at 08:36
  • @UtxD: 500 million unique pairs of unicode characters? yikes. – David Robinson Mar 05 '13 at 08:39
  • @DavidRobinson: Unicode can provide 110,000*110,000 unique pairs I guess. We are using it make unique keys here. – codersofthedark Mar 05 '13 at 08:42
  • @UtxD: Sure, I know it's possible. Not sure why you'd use pairs of characters instead of integers, but I suppose it's the same. Anyway, the other answers are correct. – David Robinson Mar 05 '13 at 08:49
  • @DavidRobinson: Because using integers would consume more space for representing 500M records in memory. Since our db is on RAM we are trying to save on space. If there is a better approach on saving space for representation for the same, would love to know. – codersofthedark Mar 05 '13 at 10:08
  • @UtxD: I understand- I had been thinking of a 32 bit system, on which an int would take up the same amount of space as two unicode characters, but on a 64 bit system... – David Robinson Mar 05 '13 at 10:17

3 Answers3

4

Mostly the performance hits (usually taken on a collision) are amortized over all calls. So for most realistic use, you wont get O(n) for every call. In fact, the only case where you would incur an O(n) hit on every call is in the pathological case where every key's hash collides with an existing key's hash value (i.e. the worst possible (or most unfortunate) usage of a hashtable).

If for example, you know your set of keys beforehand, and you know that they will not have hash collisions (i.e. all their hashes are unique), then you will not suffer collision cases. The other major O(n) operation is hashtable resizing, but the frequency of this depends on the implementation (expanse factor/hash function/collision resolution scheme, etc.) and it will also vary run-to-run depending on the input set.

In either case you can avoid sudden runtime slowdowns if you can pre-populate the dict with all keys. the values can just be set to None, and populated with their real values later on. This should cause the only noticeable performance hit when "priming" the dict with keys initially, and future value insertion should be constant time.

A completely different question is how you are intending to read/query the structure? do you need to attach separate values and have access to them via a key? should it be ordered? perhaps a set might be more appropriate than a dict, since you do not really require a key:value mapping.

Update:

Based on your description in the comments, this is starting to sound more like a job for a database to do, even if you are working with a temporary set. You could use an in-memory relational database (e.g. with SQLite). Additionally, you could use an ORM like SQLAlchemy to interact with the database more pythonically and without having to write SQL.

It even sounds like you might be reading the data from a database to start with, so maybe you can leverage that further?

Storing/querying/updating a massive number of typed records that are keyed uniquely is exactly what RDBMS's have been specialised for with decades of development and research. Using an in-memory version of a pre-existing relational database (such as SQLite's) will probably be a more pragmatic and sustainable choice.

Try using python's built in sqlite3 module and try the in-memory version by providing ":memory:" as the db file path on construction:

con = sqlite3.connect(":memory:")
Preet Kukreti
  • 8,417
  • 28
  • 36
  • Hash table resize cost is fully amortizable; it costs n operations, but you need n inserts first before you *need* to resize. – Martijn Pieters Mar 04 '13 at 00:02
  • @PreetKukreti: pre-populate the dict with all keys and None value is possible. But I dont want to hardcode initialization of dict with 500M keys. I was thinking of adding keys using a for loop but then idk if unique hash would be created. How to ensure that? – codersofthedark Mar 04 '13 at 20:46
  • Each key represent a unique seat in a unique flight. The value would be updated to registration_id of the passenger. Update/Delte/Insert call would always be made by the key. IDK if we can use sets here. – codersofthedark Mar 04 '13 at 21:00
  • @PreetKukreti: I am searching for a NoSQL In Memory Soln. Can you advice any? Also, whats is the difference between caching and having inmemory db? – codersofthedark Mar 05 '13 at 10:09
2

The Dictionary technically has a a worst case of O(n) but it's highly unlikely to occur and likely won't in your case. I'd try to use the Dictionary and only switch to a different implementation if that isn't sufficient for what you want to do.

Here is a useful thread on the subject

Community
  • 1
  • 1
Snukus
  • 1,302
  • 9
  • 17
  • 50M operations daily has to be done on that dict on server memory. Situation is hard time critical and failure cannot be entertained. That being said, please elaborate on "likely won't in your case." Thanks – codersofthedark Mar 04 '13 at 20:51
2

Is there a reason you care about the worst-case performance instead of the average performance? Any reasonable hashtable will give you the average performance of O(N).

If you really want worst-case performance of O(1), here are two possible approaches:

  1. Have a vector of max(charCode)-min(charCode) entries and directly look up the value you want from the unicode character code. This will work well if your keys fall in a compact-enough range that you can fit it in RAM.

  2. Use a brute-force approach to choose hash functions or dictionary sizes (using a custom implementation of a dictionary that lets you control this), and keep trying new functions and/or sizes till you get one with no collisions. Expect this to take a very long time. I do not recommend this.

EDIT:

Suppose that you known that the minimum character code you'll see is 1234 and the maximum you'll see is 98765. Further suppose that you have enough RAM to hold 98765-1234 elements. I'll also assume you're willing to use the numpy library or some other efficient array implementation. In that case, you can store the values in the vector like this:

# configuration info
max_value = 98765 # replace with your number
min_value = 1234  # replace with your number
spread = (max_value - min_value)
dtype = object # replace with a primitive type if you want to store something simpler

# create the big vector
my_data = numpy.empty((spread,), dtype=dtype)

# insert elements
my_char_code              = ...
my_value_for_my_char_code = ...

assert min_value <= my_char_code < max_value
my_data[my_char_code - min_value] = my_value_for_my_char_code

# extract elements
my_char_code              = ...
assert min_value <= my_char_code < max_value
my_value_for_my_char_code = my_data[my_char_code - min_value]

This is O(1) because the lookup is implemented using pointer arithmetic and there's no dependence on the number of elements stored in the array.

This approach can be extremely wasteful of RAM if the number of elements you actually want to store is much smaller than spread. For example, if spread is 4 billion (all of UTF32) then my_data alone will consume at least 4 billion * 8 bytes / pointer = 32 GB of RAM (and probably a lot more; I don't know how big Python references are). On the other hand, if min_value is 3 billion and max_value = min_value + 100, then the memory usage will be tiny.

Mr Fooz
  • 109,094
  • 6
  • 73
  • 101