5

I've been working on a project that involves loading a relatively large dictionary into memory from a file. The dictionary has just under 2 million entries, each entry (key and value combined) is under 20 bytes. The size of the file on disk is 38 MB.

My problem is that when I try to load the dictionary, my program immediately expands to over 2.5 gigabytes of memory used.

Here is the code I use to read the dictionary in from disk:

f = open('someFile.txt', 'r')
rT = eval(f.read())
f.close()
dckrooney
  • 3,041
  • 3
  • 22
  • 28

2 Answers2

7

I think the memory is used to parse the dictionary syntax AST.

For this kind of use it's much better if you go for the cPickle module instead of using repr/eval.

import cPickle

x = {}
for i in xrange(1000000):
    x["k%i" % i] = "v%i" % i
cPickle.dump(x, open("data", "wb"), -1)

x = cPickle.load(open("data", "rb"))

-1 when dumping means using latest protocol that is more efficient but possibly not backward compatible with older python versions. If this is a good idea or not depends on why you need to dump/load.

outis
  • 75,655
  • 22
  • 151
  • 221
6502
  • 112,025
  • 15
  • 165
  • 265
  • you might also want to use the json module – Winston Ewert May 07 '11 at 21:50
  • Shelve is a good alternative too. It's designed for huge dictionaries which may be partially stored on disk. – Nathan May 08 '11 at 00:20
  • Thanks! I haven't had a chance to implement this yet, but I read up on pickle a little bit; it seems like that should fix the problem. – dckrooney May 08 '11 at 17:53
  • I ended up using cPickle, which worked perfectly... Memory footprint is down to a more reasonable level, and the dictionary loads MUCH faster. Thanks! – dckrooney May 09 '11 at 02:54
0

This may be a bit off-topic, but it can also helps tremendously using generator expressions when working with big files/streams of data.

This discussion explains it very well and this presentation changed the way I wrote my programs.

Community
  • 1
  • 1
Morten Jensen
  • 5,818
  • 3
  • 43
  • 55