1

Is it possible to iterate over a list using mmap file? The point is that the list is too big (over 3 000 000 items). I need to have a fast access to this list when I start the program, so I can't load it to a memory after starting program because it takes several seconds.

with open('list','rb') as f:
    mmapList = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # As far as I'm concerned, now I have the list mapped in a virtual memory.

Now, I want to iterate over this list.

for a in mmapList does not work.

EDIT: The only way I know is to save the list items as rows in txt file and then use readline but I'm curious if there is a better and faster way.

Milano
  • 18,048
  • 37
  • 153
  • 353
  • How was the list saved into that file? – pbkhrv Nov 12 '14 at 20:31
  • @pbkhrv It is saved using cPickle but I can change it. – Milano Nov 12 '14 at 20:39
  • mmap object acts a string (https://docs.python.org/2/library/mmap.html), so using cPickle to deserialize your list isn't really an option without loading the whole thing in memory first. However, it's possible to write a simple generator so that "for a in mmapList" would still work by progressively walking through the mmap using readline(). Would that solve your problem? – pbkhrv Nov 12 '14 at 20:49
  • actually, there's a way to do this and still use pickle. See my answer. – pbkhrv Nov 12 '14 at 21:23

1 Answers1

1

You don't need to use mmap to iterate though the cPickled list. All you need to do is instead of pickle'ing the whole list, pickle and dump each element, then read them one by one from the file (can use a generator for that).

Code:

import pickle

def unpickle_iter(f):
  while True:
    try:
      obj = pickle.load(f)
    except EOFError:
      break
    yield obj

def save_list(list, path):
  with open(path, 'w') as f:
    for i in list:
        pickle.dump(i, f)

def load_list(path):
  with open(path, 'r') as f:
     # here is your nice "for a in mmaplist" equivalent:
     for obj in unpickle_iter(f):
        print 'Loaded object:', obj

save_list([1,2,'hello world!', dict()], 'test-pickle.dat')
load_list('test-pickle.dat')

Output:

Loaded object: 1
Loaded object: 2
Loaded object: hello world!
Loaded object: {}
pbkhrv
  • 647
  • 4
  • 11
  • Thank you for your advice. I tried this code and it works. But I'm forced to go back to the text-save solution because this takes too much time. The text-saved mmap solution takes 1.9 sec in my case and this solution takes 7.8. – Milano Nov 13 '14 at 09:09
  • Interesting, sounds like unpickling adds a bit of overhead. If performance is an issue, here is an interesting discussion of various serialization methods: http://stackoverflow.com/questions/9662757/python-performance-comparison-of-using-pickle-or-marshal-and-using-re – pbkhrv Nov 14 '14 at 04:03