1

I use a dictionary to store unique keys mapped to unique values.

First keys were str and values int. I used sys.getsizeof(dict) to get the size and i also printed the dict into a file.

debugggg = csv.writer(open("debug1.txt", 'w')) 
for key, val in diction.items():
      debugggg.writerow([key, val])

I got 296 MB from the file and 805306648 from sys.getsizeof()

Then as values I stored the same keys but this time I hashed them before mapping

diction[hash(mykey_1)] = value_1

And I was expecting this to be a bit more compressed than the previous approach.

I ran the same functions to get the size. From the file i got 362MB(!) and from sys.getsizeof() i got the same result as previous (805306648)

Time for process was the same , as it was expected O(1) lookups. But I am a bit confused about the sizes.

  • Well, according to http://stackoverflow.com/questions/114830/is-a-python-dictionary-an-example-of-a-hash-table Python dictionaries are already hashtables, so it seems probable that a better hash function for the problem is being used in the first case. This doesn't surprise me. – Tripp Kinetics May 28 '14 at 14:51

2 Answers2

2

sys.getsizeof(some_dict) accounts for the size of the internal hash table, which is roughly proportional to the number of keys. But it does not account for the size of the keys and values, partly because that would be tricky to do correctly, and partly because there can be many other references to those objects so it's a bit misleading to include them (their size might be amortized over many different dicts, for instance).

As for the file size: Aside from the fact that this does include the sizes of keys and values, the values (integers) are encoded differently, possibly more inefficiently. Overall this might be balanced by the fact that an int object contains considerable metadata and rounds up the actual data to 4 to 8 bytes. Other factors: A CSV file includes the commas and line breaks, and a hash is often a large number, larger than many short strings (hash("a") is -392375501 on my machine).

Side note: It's likely that the dict you built using diction[hash(mykey1_)] = ... is wrong. You're doing the hashing outside of the eyes of the dictionary, so it can't protect you from hash collisions. You might lose some values because their keys hash to the same integer. Since the internal hash table is always a power of two and only resized at certain threshold, having a few entries less doesn't necessarily show in sys.getsizeof.

0

First, O(1) is the average time for lookups in a dict, but it can be as bad as O(n). Here's the source for that information.

Second, the hash function returns an integer. sys.getsizeof(hash("Some value")) returns 12 on my platform. But if you read up on how python dictionaries are implemented the size of the key doesn't really have much to do with how many bytes the entire dictionary takes up. That has more to do with how many total items you're storing, not how you access them.

FrobberOfBits
  • 17,634
  • 4
  • 52
  • 86