21

I've been having a hard time using a large dictionary (~86GB, 1.75 billion keys) to process a big dataset (2TB) using multiprocessing in Python.

Context: a dictionary mapping strings to strings is loaded from pickled files into memory. Once loaded, worker processes (ideally >32) are created that must lookup values in the dictionary but not modify it's contents, in order to process the ~2TB dataset. The data set needs to be processed in parallel otherwise the task would take over a month.

Here are the two three four five six seven eight nine approaches (all failing) that I have tried:

  1. Store the dictionary as a global variable in the Python program and then fork the ~32 worker processes. Theoretically this method might work since the dictionary is not being modified and therefore the COW mechanism of fork on Linux would mean that the data structure would be shared and not copied among processes. However, when I attempt this, my program crashes on os.fork() inside of multiprocessing.Pool.map from OSError: [Errno 12] Cannot allocate memory. I'm convinced that this is because the kernel is configured to never overcommit memory (/proc/sys/vm/overcommit_memory is set to 2, and I can't configure this setting on the machine since I don't have root access).

  2. Load the dictionary into a shared-memory dictionary with multiprocessing.Manager.dict. With this approach I was able to fork the 32 worker process without crashing but the subsequent data processing is orders of magnitude slower than another version of the task that required no dictionary (only difference is no dictionary lookup). I theorize that this is because of the inter-process communication between the manager process containing the dictionary and each worker process, that is required for every single dictionary lookup. Although the dictionary is not being modified, it is being accessed many many times, often simultaneously by many processes.

  3. Copy the dictionary into a C++ std::map and rely on Linux's COW mechanism to prevent it from being copied (like approach #1 except with the dictionary in C++). With this approach, it took a long time to load the dictionary into std::map and subsequently crashed from ENOMEM on os.fork() just as before.

  4. Copy the dictionary into pyshmht. It takes far too long to copy the dictionary into pyshmht.

  5. Try using SNAP's HashTable. The underlying implementation in C++ allows for it to be made and used in shared memory. Unfortunately the Python API does not offer this functionality.

  6. Use PyPy. Crash still happened as in #1.

  7. Implement my own shared-memory hash table in python on top of multiprocessing.Array. This approach still resulted in the out of memory error that ocured in #1.

  8. Dump the dictionary into dbm. After trying to dump the dictionary into a dbm database for four days and seeing an ETA of "33 days", I gave up on this approach.

  9. Dump the dictionary into Redis. When I try to dump the dictionaries (the 86GB dict is loaded from 1024 smaller dicts) into Redis using redis.mset I get a connection reset by peer error. When I try to dump the key-value pairs using a loop, it takes an extremely long time.

How can I process this dataset in parallel efficiently without requiring inter-process communication in order to lookup values in this dictionary. I would welcome any suggestions for solving this problem!

I'm using Python 3.6.3 from Anaconda on Ubuntu on a machine with 1TB RAM.


Edit: What finally worked:

I was able to get this to work using Redis. To get around the issued in #9, I had to chunk the large key-value insertion and lookup queries into "bite-sized" chunks so that it was still processing in batches, but didn't time-out from too large a query. Doing this allowed the insertion of the 86GB dictionary to be performed in 45 minutes (with 128 threads and some load balancing), and the subsequent processing was not hampered in performance by the Redis lookup queries (finished in 2 days).

Thank you all for your help and suggestions.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jon Deaton
  • 3,943
  • 6
  • 28
  • 41
  • 3
    CPython refcounting means you write to an object if you so much as *look* at it, or even if you don't look at it, but a reference to it passes through your hands. This doesn't play well with copy-on-write. – user2357112 Mar 22 '18 at 21:43
  • @user2357112 Indeed, I thought that might be the reason why approach #1 failed. – Jon Deaton Mar 22 '18 at 21:45
  • What are the key and value types? And do you care about worst-case time. or just usually-fast? If you're mapping 8-bit strings to 8-bit strings, using a local key-value store, maybe even as simple `dbhash` or `gbdm`, may be reasonably efficient almost all the time as long as you leave enough memory free for caching. – abarnert Mar 22 '18 at 21:46
  • Meanwhile, an alternative to the C++ idea is to use a giant `multiprocessing.Array` for sharing, and build a simple (immutable) hash table on top of that. Or of course you could use the C++ idea and explicitly share the memory instead of forking and crossing your fingers and hoping for COW. – abarnert Mar 22 '18 at 21:47
  • @abarnert strings to strings. C-strings would get the job done since there aren't any special characters. – Jon Deaton Mar 22 '18 at 21:47
  • See if PyPy might perform better. It doesn't use refcounting, but you still might have problems with the GC writing to objects (I don't know how PyPy GC works). – user2357112 Mar 22 '18 at 21:47
  • 1
    @user2357112 The refcounting isn't a performance issue, it's a correctness issue—he gets an `ENOMEM` error while trying to fork and copy. – abarnert Mar 22 '18 at 21:48
  • @abarnert: I'm not sure what you're trying to say there. I'm saying that PyPy might trigger less writes (and thus less copy-on-write copies) because no refcounting means looking at an object isn't a write, so it might not run out of memory. – user2357112 Mar 22 '18 at 21:50
  • @JonDeaton OK, then maybe try gdbm and dbhash to see if they're good enough out of the box. (Although make sure your gdbm is built with high enough limits… IIRC, the default is 64K keys, even if it's rarely compiled with the default.) If not, you may want to look at different options before looking at the more modern local key-value stores, but using stdlib modules is simple enough to check in a few minutes. – abarnert Mar 22 '18 at 21:50
  • @user2357112 He hasn't even gotten to any writes (that is, reads that make a new reference) at the time he's hitting the error, so how could reducing the number/frequency of writes solve that? – abarnert Mar 22 '18 at 21:52
  • @user2357112 I'm open to trying PyPy if it potentially avoides copying the dictionary on forking. Seems like it would be a good first thing to try. – Jon Deaton Mar 22 '18 at 21:53
  • One thing PyPy could _definitely_ do is speed up a pure-Python immutable hash table written on top of a `multiprocessing.Array` or hunk of raw shared memory. – abarnert Mar 22 '18 at 22:01
  • @abarnert I was wondering if you could clarify a little bit about how to use a `multiprocessing.Array` as an immutable hash table. I'm interested in this idea and how sure how to go about it. – Jon Deaton Mar 22 '18 at 23:51
  • @JonDeaton How much do you know about implementing a hash table? Each slot in the hash table is just a key and value (either embedded strings if they're short and with little variance, or offsets into a separate string table otherwise). You define that as a `Struct`, and then create a `multiprocessing.Array` of that `Struct` type, and you've got a shared-memory hash table. Since nobody's mutating it after creation, you don't need to worry about synchronization. And you can wrap it up in an object that looks like a `collections.abc.Mapping` pretty easily. – abarnert Mar 23 '18 at 00:32
  • @abarnert I have written hash tables in C before. What about hash collisions though? Shouldn't I make a `multiprocessing.Array` of buckets, each of which contains a list of key-value pairs to solve the hash collision problem? – Jon Deaton Mar 23 '18 at 00:35
  • You can do it that way. But I think it makes more sense to use probing than link-chaining here. That way you don't have to go chasing pointers in Python code. If `table.slots[keyhash % len(table.slot)].key != key` you just try again with `keyhash+1` and so on until you find it. (Since this is an immutable table that's going to be read much more often than written, you may want to waste time rebuilding it N times with random probing offsets to see which way ends up with the fewest reprobes, or other such optimizations, but probably not necessary.) – abarnert Mar 23 '18 at 00:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167376/discussion-between-jon-deaton-and-abarnert). – Jon Deaton Mar 23 '18 at 00:53
  • 2
    Why not use a DB or something like Redis if you want everything in memory for speed? – juanpa.arrivillaga Mar 23 '18 at 01:22
  • @juanpa.arrivillaga I'm open to that idea, however since this is a one-time data transformation and that would involve re-writing the majority of the pipeline that I've already written, I would like to avoid that. However, if I get really desperate I might give that a try. – Jon Deaton Mar 23 '18 at 01:25
  • 2
    @JonDeaton Redis would be pretty painless, I don't think you'd have to re-write anything. You could probably wrap the Redis client in some class that implements `__getitem__` and `__setitem__` and it would be a drop-in replacement for your `dict`. I'm just saying, Redis *solves this problem already*. Almost certainly, it would require less effort than implementing a hash-map over `multiprocessing.Array` – juanpa.arrivillaga Mar 23 '18 at 02:21
  • @JonDeaton i.e. https://redislabs.com/lp/python-redis/ – juanpa.arrivillaga Mar 23 '18 at 02:23
  • I would go for redis as well for this. I would about to post about the same and then saw that a comment already exists. See this [thread](https://redis.io/topics/virtual-memory) on redis about using redis in VM mode – Tarun Lalwani Mar 25 '18 at 18:23
  • It seems that nobody mentioned that you could have had hit the limits set be Linux ulimit and cgroups. Here is a small introduction: https://unix.stackexchange.com/questions/302938/about-ulimit-setrlimit-and-cgroup – VPfB Mar 26 '18 at 08:51
  • 2
    You really should avoid building a dict as large as this in memory. Use a database instead. Redis, SQLite, a heavier database, and use a wrapper that implements the mapping interface if you don’t want to retool all your code. – Martijn Pieters Mar 27 '18 at 05:58
  • @juanpa.arrivillaga Thank you juanpa for your suggestion to use Redis. Using Redis was actually what ended up solving this problem! – Jon Deaton Apr 06 '18 at 00:24

10 Answers10

10

You should probably use a system that's meant for sharing large amounts of data with many different processes -- like a Database.

Take your giant dataset and create a schema for it and dump it into a database. You could even put it on a separate machine.

Then launch as many processes as you want, across as many hosts as you want, to process the data in parallel. Pretty much any modern database will be more than capable of handling the load.

Brendan Abel
  • 35,343
  • 14
  • 88
  • 118
  • Would you be able to suggest a specific database that would work well for this? I tried using `dbm` and `redis` and both took an extremely long time to load the data into it. – Jon Deaton Mar 31 '18 at 18:00
  • `redis` is designed to store everything in memory, which isn't really possible with a 2TB dataset. I'm a fan of `postgresql` with `sqlalchemy` as the python ORM. Unfortunately, it will likely require a significant refactor of your code, or at least require an abstraction layer to turn database queries into dictionaries that your code can process. – Brendan Abel Mar 31 '18 at 22:39
  • oh im not storing the whole 2TB dataset in memory, just the 86GB key value mapping. The dataset itself is getting incrementally processed and is never all in memory at once. – Jon Deaton Mar 31 '18 at 22:53
  • @JonDeaton Same deal with the 86 GB. You're probably going to have to query just that part of the dictionary you need to process and no use the whole dictionary at once. – Brendan Abel Apr 01 '18 at 00:53
6

Instead of using a dictionary, use a data structure that compresses data, but still has fast lookups.

e.g:

  • keyvi: https://github.com/cliqz-oss/keyvi keyvi is a FSA-based key-value data structure optimized for space & lookup speed. multiple processes reading from keyvi will re-use the memory, because a keyvi structure is memory mapped and it uses shared memory. Since your worker processes don't need to modify the data structure, I think this would be your best bet.

  • marisa trie: https://github.com/pytries/marisa-trie static trie structure for Python, based on the marisa-trie C++ library. Like keyvi, marisa-trie also uses memory-mapping. Multiple processes using the same trie will use the same memory.

EDIT:

To use keyvi for this task, you can first install it with pip install pykeyvi. Then use it like this:

from pykeyvi import StringDictionaryCompiler, Dictionary

# Create the dictionary
compiler = StringDictionaryCompiler()
compiler.Add('foo', 'bar')
compiler.Add('key', 'value')
compiler.Compile()
compiler.WriteToFile('test.keyvi')

# Use the dictionary
dct = Dictionary('test.keyvi')
dct['foo'].GetValue()
> 'bar'
dct['key'].GetValue()
> 'value'

marisa trie is just a trie, so it wouldn't work as a mapping out of the box, but you can for example us a delimiter char to separate keys from values.

tomas
  • 963
  • 6
  • 19
  • 2
    In agreement here. There are some easy to use hash maps that are memory mapped and for the most used keys (or pages) will perform very fast. Take a look at lmdb: https://lmdb.readthedocs.io/en/release/ . No server is needed and they support multiple processes as explained here: http://www.lmdb.tech/doc/ . – Oren Mar 29 '18 at 19:38
5

If you can successfully load that data into a single process in point 1, you can most likely work around the problem of fork doing copies by using gc.freeze introduced in https://bugs.python.org/issue31558

You have to use python 3.7+ and call that function before you fork. (or before you do the map over process pool)

Since this requires a virtual copy of the whole memory for the CoW to work, you need to make sure your overcommit settings allow you to do that.

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • I'm open to this idea but not entirely convinced that it would work. The problem with the fork is that it's failing because the system detects that there would not be enough memory *if* the entire process's memory needed to be copied. – Jon Deaton Mar 25 '18 at 02:45
  • @JonDeaton I don't think that's why the error is raised. It's not Python's job to try predicting memory usage. If Python says it can't allocate something, that's normally at the point when it actually tries to do that. It means that either it tries to do a large copy early (gc.freeze would help), or system prevents overcommit (adjust the overcommit settings https://www.kernel.org/doc/Documentation/vm/overcommit-accounting ) – viraptor Mar 25 '18 at 02:52
  • Extending overcommit should be safe, because even though the CoW mapping will be huge (especially with 32 processes), you know you're never going to actually use that memory. – viraptor Mar 25 '18 at 02:53
  • You are right it's not Python's job but it is the OS's job. `/proc/sys/vm/overcommit_memory` is `2` on my machine, meaning that Linux will not overcommit memory, and I'm pretty sure that this is why `fork` is failing – Jon Deaton Mar 25 '18 at 02:54
  • If you know you're doing a lot of CoW, it's ok to enable unlimited overcommit (or calculate exactly what ratio you need if you want it in production). – viraptor Mar 25 '18 at 02:57
  • 1
    I don't have root access on this machine and can't enable unlimited overcommit. – Jon Deaton Mar 25 '18 at 02:57
  • 1
    I think that's a major issue to solve in that case. If you can't share the pages in python, you won't be able to share them using other means either. – viraptor Mar 25 '18 at 03:01
  • Yes, it is an issue. I can share pages I just can't share so many that the OS doesn't have the resources to copy them all, if need be, even though that need will never arise. – Jon Deaton Mar 25 '18 at 03:02
5

As most people here already mentioned:
Don't use that big a dictionary, Dump it on a Database instead!!!

After dumping your data into a database, using indexes will help reduce data retrieval times.
A good indexing explanation for PostgreSQL databases here.
You can optimize your database even further (I give a PostgreSQL example because that is what I mostly use, but those concepts apply to almost every database)


Assuming you did the above (or if you want to use the dictionary either way...), you can implement a parallel and asynchronous processing routine using Python's asyncio (needs Python version >= 3.4).

The base idea is to create a mapping method to assign (map) an asynchronous task to each item of an iterable and register each task to asyncio's event_loop.

Finally, we will collect all those promises with asyncio.gather and we will wait to receive all the results.

A skeleton code example of this idea:

import asyncio

async def my_processing(value):
    do stuff with the value...
    return processed_value

def my_async_map(my_coroutine, my_iterable):
    my_loop = asyncio.get_event_loop()
    my_future = asyncio.gather(
        *(my_coroutine(val) for val in my_iterable)
    )
    return my_loop.run_until_complete(my_future)

my_async_map(my_processing, my_ginormous_iterable)


You can use gevent instead of asyncio, but keep in mind that asyncio is part of the standard library.

Gevent implementation:

import gevent
from gevent.pool import Group

def my_processing(value):
    do stuff with the value...
    return processed_value

def my_async_map(my_coroutine, my_iterable):
    my_group = Group()
    return my_group.map(my_coroutine, my_iterable)

my_async_map(my_processing, my_ginormous_iterable)
John Moutafis
  • 22,254
  • 11
  • 68
  • 112
4

The already mentioned keyvi (http://keyvi.org) sounds like the best option to me, because "python shared memory dictionary" describes exactly what it is. I am the author of keyvi, call me biased, but give me the chance to explain:

Shared memory make it scalable, especially for python where the GIL-problematic forces you to use multiprocessing rather than threading. That's why a heap-based in-process solution wouldn't scale. Also shared memory can be bigger than main memory, parts can be swapped in and out.

External process network based solutions require an extra network hop, which you can avoid by using keyvi, this makes a big performance difference even on the local machine. The question is also whether the external process is single-threaded and therefore introduces a bottleneck again.

I wonder about your dictionary size: 86GB: there is a good chance that keyvi compresses that nicely, but hard to say without knowing the data.

As for processing: Note that keyvi works nicely in pySpark/Hadoop.

Your usecase BTW is exactly what keyvi is used for in production, even on a higher scale.

The redis solution sounds good, at least better than some database solution. For saturating the cores you should use several instances and divide the key space using consistent hashing. But still, using keyvi, I am sure, would scale way better. You should try it, if you have to repeat the task and/or need to process more data.

Last but not least, you find nice material on the website, explaining the above in more detail.

3

Maybe you should try do it in database, and maybe try to use Dask to solve your problem,let Dask to care about how to multiprocessing in the low level. You can focus on the main question you want to solve using that large data. And this the link you may want to look Dask

ye jiawei
  • 882
  • 7
  • 7
2

Well I do believe that the Redis or a database would be the easiest and quickest fix.

But from what I understood, why not reduce the problem from your second solution? That is, first try to load a portion of the billion keys into memory (say 50 Million). Then using Multi-processing, create a pool to work on the 2 TB file. If the lookup of the line exists in the table, push the data to a list of processed lines. If it doesn't exist, push it to a list. Once you complete reading the data set, pickle your list and flush the keys you have stored from memory. Then load the next million and repeat the process instead reading from your list. Once it is finished completely, read all your pickle objects.

This should handle the speed issue that you were facing. Of course, I have very little knowledge of your data set and do not know if this is even feasible. Of course, you might be left with lines that did not get a proper dictionary key read, but at this point your data size would be significantly reduced.

Don't know if that is of any help.

Haris Nadeem
  • 1,322
  • 11
  • 24
  • I tried using Redis for this. It seems to be taking an extremely long time to load the data into the database. I can't use `redis.mset` since I get a connection reset by peer, and inserting the 2 billion key-value pairs takes forever, even if multiprocessing is used. – Jon Deaton Mar 31 '18 at 18:03
  • I see. Are you aggregating the results in memory or are you writing each line to file that has been processed to disk? – Haris Nadeem Mar 31 '18 at 18:30
  • I believe that I am aggregating the results in memory. I take each of the 1024 dictionaries and dump them one by one into the Redis database through a local network connection. It takes a long time to loop through the key-value pairs, and I cannot use `redis.mset` because each of those dictoinaries is too large and I get `redis.exceptions.ConnectionError: connection reset by peer` – Jon Deaton Mar 31 '18 at 18:46
  • i couldn't agree more. Unfortunately I didn't forsee this issue when I set out on the project. If I do anything like this in the future I will make sure to use a more scalable language. – Jon Deaton Mar 31 '18 at 19:31
  • After re-reading everything, I realized the bottle neck arises from using python itself. Because, python doesn't multithread efficiently (bcz of GIL) it uses multiprocessing becoming independent processes and require multiple copies of the map. And redis doesnt make sense seeing as you have a lot of data to load at upfront rather than in incremental insertions. Of course, blaming the language does not mitigate the problem but does bring the problem more into focus. Esp since you have 1 TB of Ram. – Haris Nadeem Mar 31 '18 at 19:37
  • I am sure switching languages like to java would help but is too time consuming, but would you be interested in looking in frameworks such as Apache beam? I could describe the use case for you – Haris Nadeem Mar 31 '18 at 19:37
  • Sorry, it wouldn't let me edit the comment so I ended up deleting the previous one. So Apache beam has a python framework that utilizes multi-threading and multi-instance approach and scale to the cloud. I am not a big fan of pushing frameworks, but Apache beam would be a faster switch (it operates mostly on functional programming paradigms with a little Object oriented programming). Of course, it might be a learning curve, but let me know if that's a possibility? If not, I can try to think of another way. And I totally understand I've made the same mistake several times before – Haris Nadeem Mar 31 '18 at 19:40
  • I ended up being able to get around the up-front insertion bottle neck by splitting up the database insertions into "bite-sized" chunks of dictionaries that Redis could handle. It took only about an hour to insert the 86GB dictionary (definitely good enough). – Jon Deaton Apr 01 '18 at 18:31
  • That's awesome! Glad to hear that you were able to manipulate redis to work for you. Your idea is pretty ingenious. I hadnt thought of that. That should hopefully resolve your memory issue.Since the dictionary isnt being modified, you dont have to implement locks or race conditions. Does that resolve all the issues you were facing? – Haris Nadeem Apr 01 '18 at 19:14
  • I have yet to test out the part of the program that actually does the lookup of values in the Redis database, but from the looks of it, it will be plenty fast enough. I'll have a definite answer on whether Redis solved the entire problem by the end of the day. – Jon Deaton Apr 01 '18 at 19:16
0

Another solution could be to use some existing database driver which can allocate / retire pages as necessary and deal with the index lookup quickly.

dbm has a nice dictionary interface available and with automatic caching of pages may be fast enough for your needs. If nothing is modified, you should be able to effectively cache the whole file at VFS level.

Just remember to disable locking, open in not synch-ed mode, and open for 'r' only so nothing impacts caching/concurrent access.

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • Do you know how to configure `dbm` so that it can be written/read to from multiple processes? I'm having a hard time getting that info – Jon Deaton Mar 26 '18 at 02:15
  • 1
    If you force the gnu variant, you can use `rfu` for mode: https://docs.python.org/3/library/dbm.html#dbm.gnu.open and open in each process. That's if you want to read though. I don't know if you can do concurrent writes. – viraptor Mar 26 '18 at 03:13
  • Oh alright, if it cant be written to by multiple processes then thats ok. I was just hoping it could be so that I could load it in faster, but so long as it can be read in parallel, then it should solve my problem. – Jon Deaton Mar 26 '18 at 04:03
  • I'm giving `dbm` a try, but with the way I'm doing it looks like it would take about a week to copy in the 86GB dictionary into `dbm`. I'm just looping through all the keys and inserting them into the database and have opened it up with `cfu`. Should it be taking this long and if not how can I make this workable? – Jon Deaton Mar 26 '18 at 18:19
  • Sorry, I don't know about the insert speed. It was a suggestion for the reading side mostly :( – viraptor Mar 26 '18 at 22:40
0

Since you're only looking to create a read-only dictionary it is possible that you can get better speed than some off the shelf databases by rolling your own simple version. Perhaps you could try something like:

import os.path
import functools
db_dir = '/path/to/my/dbdir'

def write(key, value):
    path = os.path.join(db_dir, key)
    with open(path, 'w') as f:
        f.write(value)

@functools.lru_cache(maxsize=None)
def read(key):
    path = os.path.join(db_dir, key)
    with open(path) as f:
        return f.read()

This will create a folder full of text files. The name of each file is the dictionary key and the contents are the value. Timing this myself I get about 300us per write (using a local SSD). Using those numbers theoretically the time taken to write your 1.75 billion keys would be about a week but this is easily parallelisable so you might be able to get it done a lot faster.

For reading I get about 150us per read with warm cache and 5ms cold cache (I mean the OS file cache here). If your access pattern is repetitive you could memoize your read function in process with lru_cache as above.

You may find that storing this many files in one directory is not possible with your filesystem or that it is inefficient for the OS. In that case you can do like the .git/objects folder: Store the key abcd in a file called ab/cd (i.e. in a file cd in folder ab).

The above would take something like 15TB on disk based on a 4KB block size. You could make it more efficient on disk and for OS caching by trying to group together keys by the first n letters so that each file is closer to the 4KB block size. The way this would work is that you have a file called abc which stores key value pairs for all keys that begin with abc. You could create this more efficiently if you first output each of your smaller dictionaries into a sorted key/value file and then mergesort as you write them into the database so that you write each file one at a time (rather than repeatedly opening and appending).

Oscar Benjamin
  • 12,649
  • 1
  • 12
  • 14
0

While the majority suggestion of "use a database" here is wise and proven, it sounds like you may want to avoid using a database for some reason (and you are finding the load into the db to be prohibitive), so essentially it seems you are IO-bound, and/or processor-bound. You mention that you are loading the 86GB index from 1024 smaller indexes. If your key is reasonably regular, and evenly-distributed, is it possible for you to go back to your 1024 smaller indexes and partition your dictionary? In other words, if, for example, your keys are all 20 characters long, and comprised of the letters a-z, create 26 smaller dictionaries, one for all keys beginning with 'a', one for keys beginning 'b' and so on. You could extend this concept to a large number of smaller dictionaries dedicated to the first 2 characters or more. So, for example, you could load one dictionary for the keys beginning 'aa', one for keys beginning 'ab' and so on, so you would have 676 individual dictionaries. The same logic would apply for a partition over the first 3 characters, using 17,576 smaller dictionaries. Essentially I guess what I'm saying here is "don't load your 86GB dictionary in the first place". Instead use a strategy that naturally distributes your data and/or load.

Darren
  • 1,417
  • 13
  • 23