3

I have been using a dict to store key value pairs where the key and value are both sha256 hash digests. I need to be able to find out if a key exists in the list, and also be able to retrieve the value for that dict.

At the moment I estimate I will need about 10Gb of memory to store 8,000,000 hashes based on some of my tests, where as the actual data being stored is only about 512MB (32 bytes per hash, so 64 bytes per record)

Does anyone have any suggestions?

UPDATE, based on some of the comments I thought I should update. I am storing the hashes as bytes, not as a hex string. I am using an sqlite database for permanent storage of the data, but inserting that many records with an index gets too slow after about 1,000,000 records, and without the index checking existence of a key gets exponentially slower as well. Thats why I want to use an in memory structure to do the lookup.

UPDATE 2

Could this work? atbr hashtable

My Solution: (Should I put this as an answer?) What I ended up doing was taking lots of advice from @abarnert creating a new class that implements 1024 lists of [count, bytearray(8000 * 32), bytearray(8000 *32)]

I use the first 10 bits of the hash as in index to which list I should store the hashes in. Then I just append the key to the next 32byte slot, and the value to the same slot in the other byte array.

I can generate 16,000,000 hashes (one for the key and one for the value) and insert the 8,000,000 key value pairs into the structure in about 30 seconds.

Searching is just the reverse I use the first 10 bits to find the list and then I just do a linear search for the hash until I find it.

Searching for 200,000 hashes picked at random from the 8,000,000 takes 30 seconds, so 40 times slower than writing, but it should be fast enough for my needs.

The bonus of it all, it now only consumes 519MB of RAM for all 8,000,000 hashes.

Thank you everyone for your help.

Deano123
  • 161
  • 1
  • 9
  • 1
    One straightforward way to reduce memory consumption is to write your hash table in C. You can still use it from Python. – Dietrich Epp Jan 10 '14 at 19:23
  • Use a database. A `sqlite` library comes with Python. – Martijn Pieters Jan 10 '14 at 19:24
  • I am afraid I am not that clever, that's why I am writing my code in python, it would take me too long to learn how to do that. But if I get really desperate I might have to learn. – Deano123 Jan 10 '14 at 19:24
  • I am using an sqlite database but as i need to keep inserting values I cant have an index, it takes longer and longer to insert records the more values that are in there. Without the index, checking if and item exists takes too long. – Deano123 Jan 10 '14 at 19:25
  • Are you storing the hash digest as bytes, or as hex strings? If the latter… that's the first thing you want to change. – abarnert Jan 10 '14 at 19:25
  • When using sqlite, have you done the "usual trick" of using transactions? They speed up access by a lot if you wrap multiple updates in a single transaction. – Lasse V. Karlsen Jan 10 '14 at 21:03
  • "Should I put this as an answer?" - yes please – milahu Aug 30 '23 at 19:10
  • "I use the first 10 bits of the hash as in index to which list I should store the hashes in" - the more generic problem is called "search tree". usually you would use the first N bytes to partition your data into buckets, or into a tree of buckets, for example 256 x 256 buckets from the first 2 bytes – milahu Aug 30 '23 at 19:16
  • "[count, bytearray(8000 * 32), bytearray(8000 *32)]" it should be faster to store key and value together for each entry, so you would have one bytearray(2 * 8000 * 32) array, the key index is 2*I and the value index is 2*i+1 – milahu Aug 30 '23 at 19:18

6 Answers6

3

First, let's look at why this is so big.

Each has is 32 bytes. That means it takes about 32 bytes to store in binary form in, e.g., the storage of a bytes or bytearray object. So far, so good.

But all Python objects have headers, typically on the order of 24-64 bytes. From a quick check, it looks like bytes objects take up an extra 36 bytes on 32 bits (possibly plus alignment padding), and 48 on 64 bits, at least on the two versions of CPython I checked.

So, how can you get rid of that 150% extra storage? Pack the bytes into one giant array, like a bytes or bytearray. Then you've got 48 bytes total plus 32 per hash, instead of 48+32 per hash. When you need to access a hash, if you've got the index, it's just the slice [index*32:(index+1)*32].

Also, depending on how you created the bytes, there can be some overflow slop. You can check that—if sys.getsizeof(s) - sys.getsizeof(b'') > len(s), you need to slice all of your objects to create new copies with no extra padding.

Anyway, now you have 8M extra indexes. If those are transient, that's fine, but if you're storing them as ints in dict value slots, then each one of them also has a header. From a quick test, on top of the 4 bytes for actual storage (for an int under 1<<31), there's a 24-byte header in both 32- and 64-bit (although very small ints can apparently be crammed into the header). So, all this does is cut your 48 bytes of waste to 28 bytes, which isn't great.

You could use some form of packed storage, like the array module. An array type I only uses 4 bytes per integer. But then you need indices into the array, which… is the same problem you just solved.

But you really don't even need the indices—if you store the keys themselves in an array, the index of any key is already the index of the hash in the byte string (divided by 32), right?

This only works if you can store the keys in some sort of compact array. If they're all around the same size, you can do this by using the same "giantbytestring" trick again. And in your case, they are—the keys are also 32-byte hashes. So you just have to keep both giant byte strings sorted by the key values (see the bisect module so you don't have to write that code yourself).

Of course using binary search algorithms instead of hashing means that you're making the lookups and inserts logarithmic instead of constant. And, while log(8M) is only around 16, which is a lot better than 8M, it's still 16x as bad as 1. But this is effectively what you get from an ideally-tuned relational database, except that you don't need to do any tuning, and it's all in memory, and there's no extra overhead, so it has to be an improvement over what you've tried so far.

You could of course build a custom hash table in Python, using the two giant byte arrays as your storage, and two array('I') as your indices. But that's a lot more work, so I'd try it the easy way first.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • @abarbert From a memory aspect Byte Arrays will be perfect, but searching them is quite hard, I have looked at the bisect module and I cant work out how to use it with byte arrays. it seems simple enough with a normal list, but with a byte array you effectively have have a list of single bytes. How could you search in blocks of 32? Also is it possible to insert blocks of bytes into a byte array? or just single bytes? Thanks – Deano123 Jan 13 '14 at 07:36
  • @Deano123: OK, first you can implement a `ArrayOfByteArray32s` class by using a `bytearray` for storage, and implementing `__len__`/`__getitem__`/`__setitem__` by multiplying the index or slice members by 32 and then delegating to the underlying `bytearray`. Then it looks just like a sequence of `bytearrays`, so it can be used with `bisect`. If that doesn't make sense, I can slap together an implementation. – abarnert Jan 13 '14 at 08:42
  • @Deano123: However… what does your use profile look like? If you do as many inserts as lookups, this implementation won't help enough (it makes searches logarithmic, but inserts are linear), so it's only appropriate if you're doing lots of lookups per insert. – abarnert Jan 13 '14 at 08:45
  • @abarbert Worst case I will be doing as many inserts as lookups, i take a hash, see if it is in there already and if not then add it to the list. Would it help to allocate the entire array first? and then populate it with the data? I think I understand the the `__len__` `__getitem__` `__setitem__`, but dont want to do that if it wont help. I was looking at memcache, but dont really want to have to run an extra server as well as my app, would rather do it all "in house". If inserting into an indexed table wasn't so slow in sqlite that would be perfect. Any other ideas? Thanks for your help BTW – Deano123 Jan 13 '14 at 13:20
  • 1
    @Deano123: Pre-allocating won't really help; you save the O(1) amortized reallocation time, but not the O(N) time for shifting half the array down every time you insert something into its sorted position. You can reduce that that O(log N) average/O(N) worst by strategically leaving gaps (which increases search time to O(log**2 N, but that's not a big deal), or to O(log N) worst by using a skip list instead of a simple list… but the whole virtue of the flat array was its simplicity; if you're going to write a complex custom data structure, might as well write a compact hash table or tree… – abarnert Jan 13 '14 at 20:01
  • @Deano123: Meanwhile, have you looked at the `dbm` solution from my other answer yet? Because that basically _is_ a compact (well, more compact than `dict`, not as compact as a space-optimized custom implementation) hash table, with automatic backing to disk as needed (in a way that's generally much better than relying on VM paging), which seems likely to be good enough for your use case. – abarnert Jan 13 '14 at 20:03
  • I was looking for a cross platform solution and according to all the docs I read it was unix only. But my ititial development is on linux so I will do some testing today with it and report back. – Deano123 Jan 14 '14 at 08:35
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/45180/discussion-between-deano123-and-abarnert) – Deano123 Jan 14 '14 at 10:32
2

I wrote a more complicated answer separately, because I'm not sure if this will work for you, but:

A dbm is a key-value database that works almost exactly like a dict (whose keys and values are bytes strings), except that it's backed to disk and pages values in and out as needed.

It's much simpler than a SQL database, which has three huge advantages.

  • You don't need to change your code from hash = hashes[thingy] to hash = db.execute('SELECT * FROM Hashes WHERE Thingy = ?', (thingy,)).fetchone()[0], you just continue to use hashes[thingy].

  • There are no indices to get right—the key hash is the one and only index for the one table in the database—and no other optimizations to do.

  • A good chunk of the database will be cached in memory, making it fast.


Despite being called a "Unix database", at least one of the various dbm family of modules exists every platform. See dbm: not just for Unix for details.

abarnert
  • 354,177
  • 51
  • 601
  • 671
2

If you don't want to or can't use an external database, you can create an in memory database that is much closer to the information theoretic minimum in memory use while being extremely fast. You need to use lower-level tools than Python objects though.

You could use an array.array or bytearray to store the keys and values without any overhead, which means 8M entries fit into 488 MiB. Then you can write a hash table atop of that. This is rather inconvenient though, so you may want to use an external library like cffi for working with tightly packed C structures and arrays.

A simple open addressing hash table with linear probing could work very well for your data (take the lowest N bits of the key as hash) and is not too hard to implement, even easier if you don't need deletion. Just keep the load factor reasonable, between one half and two thirds. If you want to save space (each empty entry wastes half a kilobyte), pack the key-value pairs tightly into the array and only store a pointer/index in the hash table.

  • He _does_ need dynamic insertion. See his comments on Martijn's answer. – abarnert Jan 10 '14 at 19:53
  • @abarnert Noted and adjusted answer. –  Jan 10 '14 at 20:14
  • 1
    Just for fun, [here](https://github.com/abarnert/fixedhash) is a sample implementation. It's probably got some bugs, and it's anywhere from 3x-278x slower than `dict` in my tests, but it should be easy to understand. – abarnert Jan 14 '14 at 21:33
1

Use the sqlite3 library to store your hashes in a database. The sqlite embedded database will take care of memory management for you using memory buffering and on-disk storage as best it can to satisfy your queries.

A very simple table will suffice:

import sqlite3

connection = sqlite3.connect('/tmp/hashes.db')
connection.execute('CREATE TABLE hashes (key UNIQUE, value)')

then use:

with connection:
    cursor = connection.cursor()
    sql = 'INSERT INTO hashes VALUES (?, ?)'
    cursor.executemany(sql, ((key1, hash1), (key2, hash2), ....))

and you can query the database with:

with connection:
    cursor = connection.cursor()
    sql = 'SELECT hash FROM hashes WHERE key=?'
    cursor.execute(sql, (key,))
    hash = cursor.fetchone()
    if hash is not None:
        hash = hash[0]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Sorry I have updated the question, I am already using sqlite in my application for permanent storage. – Deano123 Jan 10 '14 at 19:31
  • @Deano123: Have you considered temporarily disabling the index then recreating it *after* your insertions? – Martijn Pieters Jan 10 '14 at 19:32
  • @Deano123: also see [Sqlite3: Disabling primary key index while inserting?](http://stackoverflow.com/q/788568) – Martijn Pieters Jan 10 '14 at 19:34
  • Yes, I cant do that because I have to constantly insert new values, there is no way I can batch add them. Once my initial process is complete and all the values are loaded into the database, to get the data back out again I actually create an index so I can read them back faster. – Deano123 Jan 10 '14 at 19:35
  • Then it's the way you are inserting your hashes that is inefficient; is there no way you can collect batches of hashes for grouped insertions? Or do you need access to the already-produced hashes as you generate more? – Martijn Pieters Jan 10 '14 at 19:42
  • Hi Martjin, The software is for backing up files, the files are cut into chunks and then each chunk is hashes, I need to know if that hash already exists, so I cant see any way of producing the hashes in batches large enough to make it work. The database is constantly being added to. – Deano123 Jan 10 '14 at 19:44
0

Since it's a hash, you could implement a dictionary in a file using seek()/tell().... You would need to pre-create the file to the given size though (10GB or whatever)...

then it's just like any other hash table. you have to deal with collisions yourself, and it will be slower obviously because it's in a file...

offset = dict_hash(in_key)
dict_file.seek(offset)
hashval = dict_file.read(hash_len)

something like that (used to do things like this in C, haven't done it in python but the underlying file support is the same...)

Corley Brigman
  • 11,633
  • 5
  • 33
  • 40
  • Why build it yourself when there are modules in the stdlib that do the same thing, with nicer wrappers, with decades of testing and optimization behind them? – abarnert Jan 10 '14 at 19:36
  • good point - i have looked at dbm once or twice, but got scared off by the doc page's unix orientation. that's a better idea. – Corley Brigman Jan 10 '14 at 19:44
0

You don't actually need to store the Hash data at all. all you need to do is to properly index the hash.

Heres my indexing functions for the Hash

First We Have this fucntion for fast indexing of the hash.

ZequebaHashB[bvc_, yvc_, avc_] :=
 {Drop[Flatten[Reap[
  Module[{a34 = bvc, a35 = yvc, rt2 = avc, z75, zler},
    z75 = 1;
    zler = Total[BCC[Re[Floor[Px[a34, a35^2]]], a35^3]];
    Label[Start5629];
    Sow[Denominator[zler/FiveItt[a35, z75]], yvc];
    z75 = z75 + 1;
    If[z75 >= rt2 + 1, Goto[end5629]];
    Goto[Start5629];
    Label[end5629];
    ];]], 1], Total[BCC[Floor[Re[Px[bvc, yvc^2]]], yvc^3]], bvc};

Second we have this function for Getting a Rainbow Index of the hash

RainbowHashXG[zf_, xz_, zd_, fd_] := 
Column[{Table[
 Flatten[Drop[ZequebaHashB[zf + xf, xz, zd], -2]], {xf, 0, 
  fd - 1}], zf}];

Now When you try these functions

Table[ZequebaHashB[Hash[xu, "SHA512"], 2, 5], {xu, 1, 10}]

{{{1, 2, 3, 4, 5}, 427, 

12579926171497332473039920596952835386489858401292624452730263741969\ 1347390182282976402981790496477460666208142347425205936701161323553455\ 43156774710409041}, {{1, 1, 1, 1, 5}, 396, 37854471215291391986149267401049113295567628473597440675968265868739\ 3920246834469920751231286910611366704757913119360843344094113813460828\ 6029275267369625}, {{1, 1, 1, 2, 5}, 378, 71668700870008575285238318023246235316098096074289026150051114683524\ 8893999285271969471146596174190457020264703584540790263678736452792747\ 5984118971455163}, {{1, 2, 3, 4, 5}, 377, 33095966240281217830184164668404219514626500609945265788213543056523\ 6612792119604718913684957565086394439681603253709963629672412822522528\ 4694992131191098}, {{1, 2, 1, 4, 5}, 363, 86087420302049294430262146818103852368792727362988712093781053088200\ 5531339261473092981846995901587757487311471069416835834626804973821926\ 684090578667825}, {{1, 1, 3, 2, 5}, 374, 18586086601485268646467765285794047467027639305129763019055665664163\ 2819380637531124748570695025942793945139516664108034654512831533948189\ 743738184270224}, {{1, 1, 3, 1, 1}, 380, 72109882448403363840259529414390721196358024901859951350044294221621\ 3409708767088486766304397692430037767785681544787701437132358156239382\ 5256452011168475}, {{1, 2, 3, 4, 5}, 397, 22760214977694020069971224118591466739483553732805530503408373418535\ 1711847169063849360187954434350675389187296376543635586233555068331343\ 3001046271103001}, {{1, 2, 1, 4, 5}, 369, 11906459655144790308170064541982556680120578173098014909650827827844\ 2313493552143468785692756291539132782149145837942478466345517803751070\ 21641806135272354}, {{1, 1, 3, 2, 5}, 382, 88155955858214177781767282869972903505820511583564376117417944351446\ 8458315518532665921338085983977628624644833036888032312932654944528755\

  1. 5410805140620789}}

    Table[RainbowHashXG[Hash[xu, "SHA512"], 2, 5, 5], {xu, 1, 10}]
    

    {{{{1, 1, 1, 1, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 2, 5}, {1, 2, 1, 4, 1}, {1, 1, 3, 1, 5}},
    12579926171497332473039920596952835386489858401292624452730263741969\ 1347390182282976402981790496477460666208142347425205936701161323553455\ 43156774710409041}, {{{1, 2, 1, 4, 5}, {1, 1, 3, 2, 5}, {1, 2, 3, 4, 1}, {1, 1, 1, 1, 5}, {1, 2, 3, 4, 5}},
    37854471215291391986149267401049113295567628473597440675968265868739\ 3920246834469920751231286910611366704757913119360843344094113813460828\ 6029275267369625}, {{{1, 2, 3, 4, 5}, {1, 1, 1, 1, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 2, 5}, {1, 2, 1, 4, 1}},
    71668700870008575285238318023246235316098096074289026150051114683524\ 8893999285271969471146596174190457020264703584540790263678736452792747\ 5984118971455163}, {{{1, 2, 3, 4, 5}, {1, 1, 1, 1, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 2, 1}, {1, 2, 1, 4, 5}},
    33095966240281217830184164668404219514626500609945265788213543056523\ 6612792119604718913684957565086394439681603253709963629672412822522528\ 4694992131191098}, {{{1, 2, 3, 4, 1}, {1, 1, 3, 1, 5}, {1, 2, 1, 4, 5}, {1, 1, 3, 2, 5}, {1, 1, 3, 1, 5}},
    86087420302049294430262146818103852368792727362988712093781053088200\ 5531339261473092981846995901587757487311471069416835834626804973821926\ 684090578667825}, {{{1, 1, 1, 1, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 2, 5}, {1, 2, 1, 4, 1}, {1, 1, 3, 1, 5}},
    18586086601485268646467765285794047467027639305129763019055665664163\ 2819380637531124748570695025942793945139516664108034654512831533948189\ 743738184270224}, {{{1, 2, 3, 4, 1}, {1, 1, 1, 2, 5}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 2, 5}},
    72109882448403363840259529414390721196358024901859951350044294221621\ 3409708767088486766304397692430037767785681544787701437132358156239382\ 5256452011168475}, {{{1, 1, 3, 1, 5}, {1, 2, 3, 4, 1}, {1, 1, 1, 2, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 1, 5}},
    22760214977694020069971224118591466739483553732805530503408373418535\ 1711847169063849360187954434350675389187296376543635586233555068331343\ 3001046271103001}, {{{1, 1, 1, 2, 5}, {1, 2, 3, 4, 5}, {1, 1, 3, 1, 1}, {1, 2, 1, 4, 5}, {1, 2, 1, 4, 1}},
    11906459655144790308170064541982556680120578173098014909650827827844\ 2313493552143468785692756291539132782149145837942478466345517803751070\ 21641806135272354}, {{{1, 2, 1, 4, 5}, {1, 1, 3, 1, 1}, {1, 2, 3, 4, 5}, {1, 1, 1, 2, 5}, {1, 2, 3, 4, 5}},
    88155955858214177781767282869972903505820511583564376117417944351446\ 8458315518532665921338085983977628624644833036888032312932654944528755\ 5410805140620789}}

    FiveItt[x98_, cc5_] :=
    DifferenceRoot[
    Function[{\[FormalY], \[FormalN]}, {-cc5 -
    cc5 \[FormalY][\[FormalN]] + \[FormalY][1 + \[FormalN]] == 0, \[FormalY][1] == 1, \[FormalY][2] == cc5}]][x98];
    

    BCC[x55_, g77_] := Drop[Flatten[Reap[ Module[ {x45 = x55, z7 = 0, z8 = 0, z9, g7 = g77, bell},

    z7 = If[x45/FiveItt[Length[IntegerDigits[x45, g7]], g7] <= 1, If[x45 == 1, 1, Length[IntegerDigits[x45, g7]] - 1], Length[IntegerDigits[x45, g7]]]; bell = FiveItt[z7 - 1, g7]; z9 = g7^(z7 - 1);

    Label[SPo]; z8 = If[IntegerQ[x45/g7] && x45 > g7, Quotient[x45 - bell - (1/(2*g7)), z9], If[x45 <= g7, x45, Quotient[x45 - bell, z9]]]; Sow[z8]; x45 = x45 - (z8*(z9)); z7 = z7 - 1; z9 = z9/g7; bell = bell - z9;

    If[z7 < 1, Goto[EnD], Goto[SPo]];

    Label[EnD];

    ] ]], 1];

    Px = Compile[ {{x1d, _Complex}, {si1d, _Real}} , Module[{x1c = x1d, si1c = si1d} , x1c + 1/2 (Floor[ Re[(-4 + si1c + Sqrt[(-4 + si1c)^2

    • 8 (-2 + si1c) (-1 + x1d)])/( 2 (-2 + si1c))]] + Floor[Im[(-4 + si1c + Sqrt[(-4 + si1c)^2 + 8 (-2 + si1c) (-1 + x1d)])/( 2 (-2 + si1c))]] I) (-4 + si1c - (-2 + si1c) (Floor[ Re[(-4 + si1c + Sqrt[(-4 + si1c)^2 + 8 (-2 + si1c) (-1 + x1c)])/( 2 (-2 + si1c))]] + Floor[Im[(-4 + si1c + Sqrt[(-4 + si1c)^2 + 8 (-2 + si1c) (-1 + x1c)])/( 2 (-2 + si1c))]] I))] , CompilationTarget -> "C", "RuntimeOptions" -> "Speed"];