15

If I have a list(or array, dictionary....) in python that could exceed the available memory address space, (32 bit python) what are the options and there relative speeds? (other than not making a list that large) The list could exceed the memory but I have no way of knowing before hand. Once it starts exceeding 75% I would like to no longer keep the list in memory (or the new items anyway), is there a way to convert to a file based approach mid-stream?

What are the best (speed in and out) file storage options?

Just need to store a simple list of numbers. no need to random Nth element access, just append/pop type operations.

Zahra
  • 6,798
  • 9
  • 51
  • 76
Vincent
  • 1,579
  • 4
  • 23
  • 38

9 Answers9

14

If your "numbers" are simple-enough ones (signed or unsigned integers of up to 4 bytes each, or floats of 4 or 8 bytes each), I recommend the standard library array module as the best way to keep a few millions of them in memory (the "tip" of your "virtual array") with a binary file (open for binary R/W) backing the rest of the structure on disk. array.array has very fast fromfile and tofile methods to facilitate the moving of data back and forth.

I.e., basically, assuming for example unsigned-long numbers, something like:

import os

# no more than 100 million items in memory at a time
MAXINMEM = int(1e8)

class bigarray(object):
  def __init__(self):
    self.f = open('afile.dat', 'w+')
    self.a = array.array('L')
  def append(self, n):
    self.a.append(n)
    if len(self.a) > MAXINMEM:
      self.a.tofile(self.f)
      del self.a[:]
  def pop(self):
    if not len(self.a):
      try: self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      except IOError: return self.a.pop()  # ensure normal IndexError &c
      try: self.a.fromfile(self.f, MAXINMEM)
      except EOFError: pass
      self.f.seek(-self.a.itemsize * MAXINMEM, os.SEEK_END)
      self.f.truncate()
    return self.a.pop()

Of course you can add other methods as necessary (e.g. keep track of the overall length, add extend, whatever), but if pop and append are indeed all you need this should serve.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
8

There are probably dozens of ways to store your list data in a file instead of in memory. How you choose to do it will depend entirely on what sort of operations you need to perform on the data. Do you need random access to the Nth element? Do you need to iterate over all elements? Will you be searching for elements that match certain criteria? What form do the list elements take? Will you only be inserting at the end of the list, or also in the middle? Is there metadata you can keep in memory with the bulk of the items on disk? And so on and so on.

One possibility is to structure your data relationally, and store it in a SQLite database.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Just need to store simple list of numbers. no need to random Nth element access, just append and pop type operations. – Vincent Jan 01 '10 at 18:47
6

The answer is very much "it depends".

What are you storing in the lists? Strings? integers? Objects?

How often is the list written to compared with being read? Are items only appended on the end, or can entries be modified or inserted in the middle?

If you are only appending to the end then writing to a flat file may be the simplest thing that could possibly work.

If you are storing objects of variable size such as strings then maybe keep an in-memory index of the start of each string, so you can read it quickly.

If you want dictionary behaviour then look at the db modules - dbm, gdbm, bsddb, etc.

If you want random access writing then maybe a SQL database may be better.

Whatever you do, going to disk is going to be orders of magnitude slower than in-memory, but without knowing how the data is going to be used it is impossible to be more specific.

edit: From your updated requirements I would go with a flat file and keep an in-memory buffer of the last N elements.

Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
4

Well, if you are looking for speed and your data is numerical in nature, you could consider using numpy and PyTables or h5py. From what I remember, the interface is not as nice as simple lists, but the scalability is fantastic!

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
Shane Holloway
  • 7,550
  • 4
  • 29
  • 37
3

Did you check shelve python module which is based on pickle?

http://docs.python.org/library/shelve.html

Luka Rahne
  • 10,336
  • 3
  • 34
  • 56
1

You can try blist: https://pypi.python.org/pypi/blist/

The blist is a drop-in replacement for the Python list the provides better performance when modifying large lists.

northtree
  • 8,569
  • 11
  • 61
  • 80
1

Modern operating systems will handle this for you without you having to worry about it. It's called virtual memory.

Bjorn
  • 69,215
  • 39
  • 136
  • 164
  • This is true, but it's just really slow. Plus even virtual memory is limited (if he had an incredibly big data set). – TM. Jan 01 '10 at 18:46
  • list(combinations(1140, 17)) fills available memory address space (python error) not sure what is happening in the VM. As far as I can tell it never uses it. I have more that 4gb of internal Memory and it does not fill it as 32bit has a 4GB limit. Not sure what happens when the limit is reach other than I get an error. – Vincent Jan 01 '10 at 18:50
  • 2
    One of the things you lose (if you ever had it in the first place) with garbage-collected languages is the ability to let the virtual memory system "just handle this for you". 1/ The locality of the cons cells in your structure is not guaranteed. 2/ The GC will access them even if you don't. Virtual memory never worked great for implementing sparsely accessed data structures unless you were extremely lucky, but for an arbitrary data structure in Python, even luck will not suffice. – Pascal Cuoq Jan 01 '10 at 18:54
  • 2
    Oh, and I forgot, the question clearly mentions exhaustion of the **address space**. Virtual memory only helps with exhaustion of **physical memory** (assuming that the address space is larger than the physical memory). – Pascal Cuoq Jan 01 '10 at 18:56
  • Two comments. (0) Can you use 64-bit Python to get a truly large address space, and rely on virtual memory? (It's hard to believe you could beat virtual memory by manually paging your data to files.) (1) Do you *have* to call `list()` on `combinations()`? If you can use lazy evaluation of your iterators, can you avoid filling RAM? – steveha Jan 01 '10 at 22:08
  • @steveha I am looking into 64-bit python. I am running on OSX 10.6, I am not sure how to get 64-pit python 3.1 installed. I guess I would need to build it on my system and I would have to stumble through that . – Vincent Jan 02 '10 at 01:40
  • Thanks for clearing that up! I am glad I made an incorrect comment, I learned something from it. – Bjorn Jan 02 '10 at 17:11
1

You might want to consider a different kind of structure: not a list, but figuring out how to do (your task) with a generator or a custom iterator.

RyanWilcox
  • 13,890
  • 1
  • 36
  • 60
0

What about a document oriented database?
There are several alternatives; I think the most known one currently is CouchDB, but you can also go for Tokyo Cabinet, or MongoDB. The last one has the advantage of python bindings directly from the main project, without requiring any additional module.

Community
  • 1
  • 1
rob
  • 36,896
  • 2
  • 55
  • 65