9

I need to store a large list of numbers in memory. I will then need to check for membership. Arrays are better than lists for memory efficiency. Sets are better than lists for membership checking. I need both! So my questions are:

1) How much more memory efficient are arrays than sets? (For the converse, see my results below). 2) Is there a data structure which strikes a better balance between sets and arrays? Something like a set with a signed integer type? Or some numpy construct?

I checked out the membership timing difference with the script below. (I know timeit is better but the variance is low enough for time to be fine):

import array
import time 

class TimerContext:
    def __enter__(self):
        self.t0 = time.time()
    def __exit__(self, *args, **kwargs):
        print(time.time()-self.t0)

SIZE = 1000000

l = list([i for i in range(SIZE)])
a = array.array('I', l)
s = set(l)

print(type(l))
print(type(a))
print(type(s))

with TimerContext():
    x = 99999 in l
with TimerContext():
    x = 99999 in a
with TimerContext():
    x = 99999 in s

Results:

<class 'list'>
<class 'array.array'>
<class 'set'>
0.0012176036834716797
0.0024595260620117188
1.430511474609375e-06

So sets are a LOT faster for membership checking (please note the scientific notation). So if their memory footprint isn't that different to the array, I'll prefer to use a set. But I don't know how to check the memory footprint.

I should also add that there are a lot of questions comparing sets and lists. But I didn't see any good answers comparing arrays and sets.

Neil
  • 3,020
  • 4
  • 25
  • 48
  • 1
    There may be many other and better solutions, if one would know your real problem. – Daniel Apr 19 '19 at 18:40
  • 3
    `sys.getsizeof` showed that for your example the set is ~8 times larger than the array. https://docs.python.org/3/library/sys.html#sys.getsizeof. – CristiFati Apr 19 '19 at 18:43
  • 3
    list and array membership tests are both O(n), sets are O(1)... there are also approximate algorithms for this, that might help (e.g. Bloom filters). if you have any other constraints that you can put on the problem you might have more options – Sam Mason Apr 19 '19 at 18:58
  • 1
    google suggests that https://pypi.org/project/intset/ might help... – Sam Mason Apr 19 '19 at 19:00
  • also Python comes with a `timeit` module that does similar things to your `TimerContext`, Jupyter/IPython wraps this with a nice [`%timeit` magic](https://stackoverflow.com/q/29280470/1358308) – Sam Mason Apr 19 '19 at 19:03
  • @SamMason, Thank you for the intset suggestion. It actually looks like exactly what I want. I'm going to try it out now. – Neil Apr 19 '19 at 19:12
  • @CristiFati Thanks also for getsizeof. Much easier than I thought. – Neil Apr 19 '19 at 19:12
  • @Neil: you're welcome! :) – CristiFati Apr 19 '19 at 19:13
  • Re questions on more context, don't think it'll make much difference, but I'm essentially checking how far out of sync two databases are based on a symmetric difference of collections of unique identifiers. It's in a low memory environment, and the size of the dbs will grow indefinitely. – Neil Apr 19 '19 at 19:13
  • when you say that you need to store a large list of number in memory, should it be a python `list` ? If not, what about creating a binary tree and like this you can check for membership in a way more faster than iterating over the list ? Besides of this a `set` is a `O(1)` complexity and `list` is `O(n)`. – Chiheb Nexus Apr 19 '19 at 19:16
  • Are you sure this isn't a use case for sqldiff from sqllite? or code that looks like it/ – West Apr 21 '19 at 05:43

1 Answers1

4

If it's possible in your case, bisect performance comes close to set for membership checks (both with list and array). See results below

import array
from bisect import bisect
import sys
import time


class TimerContext:
    def __enter__(self):
        self.t0 = time.time()

    def __exit__(self, *args, **kwargs):
        print(time.time() - self.t0)


def get_size_in_megabytes(iterable):
    return round(sys.getsizeof(iterable) / (1024 ** 2), 2)


SIZE = 1000000

l = list([i for i in range(SIZE)])
a = array.array("I", l)
s = set(l)

print(type(l), get_size_in_megabytes(l))
print(type(a), get_size_in_megabytes(a))
print(type(s), get_size_in_megabytes(s))

with TimerContext():
    x = 99999 in l
with TimerContext():
    x = 99999 in a
with TimerContext():
    x = 99999 in s

print("list bisect")
with TimerContext():
    bisect(l, 99999)

print("array bisect")
with TimerContext():
    bisect(a, 99999)

Results:

<class 'list'> 8.58
<class 'array.array'> 3.81
<class 'set'> 32.0
0.0024390220642089844
0.0053005218505859375
3.814697265625e-06
list bisect
9.298324584960938e-06
array bisect
6.198883056640625e-06

Credits for sys.getsizeof ussage to @CristiFati.

Dušan Maďar
  • 9,269
  • 5
  • 49
  • 64
  • `bisect` is O(log_n)... also, `getsizeof` doesn't count the size of any referred elements, just the object itself... there are recipes that show how to recurse into objects. an `int` is 28 bytes, so a list of 1M elements should be 35MB – Sam Mason Apr 19 '19 at 19:29