5

Setup

  • Python 2.6
  • Ubuntu x64

I have a set of unique integers with values between 1 and 50 million. New integers are added at random e.g. numberset.add(random.randint(1, 50000000)). I need to be able to quickly add new integers and quickly check if an integer is already present.

Problem

After a while, the set grows too large for my low memory system and I experience MemoryErrors.

Question

How can I achieve this while using less memory? What's the fastest way to do this using the disk without reconfiguring the system e.g. swapfiles? Should I use a database file like sqlite? Is there a library that will compress the integers in memory?

John Machin
  • 81,303
  • 11
  • 141
  • 189
2371
  • 957
  • 1
  • 6
  • 19
  • 32/64 bit? How many in the set when it errors? – leppie Nov 09 '10 at 09:40
  • Ubuntu x86_64 GNU/Linux. ~2 million. It's lack of system memory, not a bug. – 2371 Nov 09 '10 at 09:46
  • You seem to have a **set** (`numberset.add(..)`) but you refer to a **list** twice -- *please edit your question to resolve the ambiguity*. What Python version? – John Machin Nov 09 '10 at 09:55
  • @user237165: Which part of **twice** don't you understand? – John Machin Nov 09 '10 at 10:08
  • @user237165:You have mentioned in your question you need to check fast weather the random number is already present.But you don't need to do it when you are using a set because only one unique value can be present in a set. – Emil Nov 09 '10 at 10:13
  • 5
    No need to be rude John. A "list" doesn't have to mean a datatype. Emil: The checking is a separate task. – 2371 Nov 09 '10 at 10:48

6 Answers6

6

Use a bit-array.This will reduce the need for huge space requirement.

Realted SO Question:

Community
  • 1
  • 1
Emil
  • 13,577
  • 18
  • 69
  • 108
  • This should be the accepted answer. I highly recommend going with the bitarray module instead of cooking your own. It solves OP's problem perfectly, is tried and tested, and is very memory efficient. Creating a bitarray for 1B integers is almost free until you start writing to it, and even then it only uses the bare minimum. In a context very similar to OP's, it reduced my program's memory usage by 90MB (compared to a standard `set()`). – Zilk Jul 07 '23 at 22:37
5

You can avoid dependencies on 3rd-party bit-array modules by writing your own -- the functionality required is rather minimal:

import array

BITS_PER_ITEM = array.array('I').itemsize * 8

def make_bit_array(num_bits, initially=0):
    num_items = (num_bits + BITS_PER_ITEM - 1) // BITS_PER_ITEM
    return array.array('I', [initially]) * num_items

def set_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    bit_array[item_index] |= 1 << bit_index

def clear_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    bit_array[item_index] &= ~(1 << bit_index)

def get_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    return (bit_array[item_index] >> bit_index) & 1
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Thanks, I wanted to see a solution that used the disk so the maximum number of items is much greater and I can probably extend this to achieve that. – 2371 Nov 09 '10 at 12:13
  • 1
    You could also avoid importing `array` by using the built-in `bytearray` (Python 2.6+) which is a more natural fit to the problem. Also might I suggest `item_index, bit_index = divmod(offset, BITSPERITEM)` – Scott Griffiths Nov 09 '10 at 12:39
  • Nitpick: `BITS_PER_ITEM` would be better. – FogleBird Nov 09 '10 at 14:20
  • @Scott Griffiths: Thanks for the mention of `bytearray`; I support a module that will run on 2.1 to 2.7 and tend to go for solutions that people stuck on antique Pythons can use. My solution as it stands will work on 2.2+. Using `divmod` would remove the dependency on the `//` operator and extend its usability back to 1.5 at least. However as the OP noted, `divmod` is slower (involves a function call) and I was never brave enough to suggest that Python would benefit from `/%` and `//%` operators :-) – John Machin Nov 09 '10 at 20:01
2

Use an array of bits as flags for each integer - the memory needed will be only 50 million bits (about 6 MB). There are a few modules that can help. This example uses bitstring, another option is bitarray:

from bitstring import BitArray
i = BitArray(50000000) # initialise 50 million zero bits
for x in xrange(100):
    v = random.randint(1, 50000000)
    if not i[v]: # Test if it's already present
        i.set(1, v) # Set a single bit

Setting and checking bits is very fast and it uses very little memory.

Scott Griffiths
  • 21,438
  • 8
  • 55
  • 85
1

Try to use array module.

Abyx
  • 12,345
  • 5
  • 44
  • 76
  • Thanks, with this type I can have 10x as many items but it's not a perfect solution. – 2371 Nov 09 '10 at 09:58
1

Depending on your requirements, you might also consider a bloom filter. It is a memory-efficient data structure for testing if an element is in a set. The catch is that it it can give false-positives, though it will never give false-negatives.

Alec Thomas
  • 19,639
  • 4
  • 30
  • 24
0

If integers are unique then use bits. Example: binary 01011111 means that there are: 1, 3, 4, 5, 6 and 7. This way every bit is used to check if its integer index is used (value 1) or not (value 0).

It was described in one chapter of "Programming Pearls" by Jon Bentley (look for "The file contains at most ten million records; each record is a seven-digit integer.")

It seems that there is bitarray module mentioned by Emil that works this way.

Michał Niklas
  • 53,067
  • 18
  • 70
  • 114