4

I have a situation in one of my projects that I can either use lists or dictionaries and I am having hard time picking which one to use.

I am analyzing large number of items (>400k). And I will have (>400k) list or dictionaries which I will use very frequently. (Get/Set/Update)

In my particular situation, using a dictionary feels like more convenient than list if I wouldn't think about performance at all. However, I know I can manage writing the same thing using lists.

Should I go for readibility and use dictionaries or going with dictionary may add too much of a overhead that will dramatically decrease my performance from the perspective of both memory and time.

I know this question is a bit too-broad. But I wanted to ask it before I start building all my logic after having this decision done.

My situation in a nutshell:

I have values for keys 0,1,...,n. For now, keys will be always integers from 0 to n which I can keep in a list.

However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers. Or integers which are not consecutive.

So, the question is if using dictionaries instead of lists in the first place wouldn't add much of a memory/time cost, I will go with dictionaries in the first place. However, I am not sure having >400k dictionaries vs. having >400k lists make big of a difference in terms of performance.

Sait
  • 19,045
  • 18
  • 72
  • 99
  • 1
    Can you describe what you are trying to do? Usually you would choose between lists and dictionaries based on how you are going to *use* the object. – BrenBarn Jul 01 '15 at 05:18
  • The list if you need order and are mostly going to iterate through the items. The dictionary if you need random access and are mostly going to get/update items. – Noufal Ibrahim Jul 01 '15 at 05:18
  • 1
    You can also use an OrderedDict, which has the same performance as a dictionary, so the "ordering" element isn't exactly true as of Python 2.7 or 3.x. In short, it's more important to use a dict if you need key/value pairs, while lists are great if the value is the actual entry. – Alex Huszagh Jul 01 '15 at 05:19
  • 2
    Doesn't really matter. If a dict makes it more readable, you should do that - especially since you know you are going to need other keys. 400k dict is not a problem on a computer with a reasonable amount of RAM. If you make the wrong choice it's really not usually too much work to switch data structures. – John La Rooy Jul 01 '15 at 05:22
  • 1
    It all depends on what you do with these data. Only you know that. – David Heffernan Jul 01 '15 at 05:27

4 Answers4

12

In direct answer to your question: dictionaries have significantly more overhead than lists:

  1. Each item consumes memory for both key and value, in contrast to only values for lists.
  2. Adding or removing an item requires consulting a hash table.

Despite the fact that Python dictionaries are extremely well-designed and surprisingly fast, if you have an algorithm that can use direct index, you will save space and time.

However, from the sound of your question and subsequent discusion, it sounds like your needs may change over time and you have some uncertainty ("However, I can think of some situations that might arise in future that I will need to keep some items for keys which are not integers")

If this is the case, I suggest creating a hybrid data structure of your own so that as your needs evolve you can address the efficiency of storage in an isolated place while allowing your application to use simple, readable code to store and retrieve objects.

For example, here is a Python3 class called maybelist that is derived from a list, but detects the presence of non-numeric keys, storing exceptions in a dictionary while providing mappings for some common list operations:

class maybelist(list):

    def __init__(self, *args):
        super().__init__(*args)
        self._extras = dict()

    def __setitem__(self, index, val):
        try:
            super().__setitem__(index, val)
            return
        except TypeError:
            # Index is not an integer, store in dict
            self._extras[index] = val
            return
        except IndexError:
            pass
        distance = index - len(self)
        if distance > 0:
            # Put 'None' in empty slots if need be
            self.extend((None,) * distance)
        self.append(val)

    def __getitem__(self, index):
        try:
            return super().__getitem__(index)
        except TypeError:
            return self._extras[index]

    def __str__(self):
        return str([item for item in self])

    def __len__(self):
        return super().__len__() + len(self._extras)

    def __iter__(self):
        for item in itertools.chain(super().__iter__(), self._extras):
            yield item

So, you could treat it like an array, and have it auto expand:

>>> x = maybelist()
>>> x[0] = 'first'
>>> x[1] = 'second'
>>> x[10] = 'eleventh'
>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh']
>>> print(x[10])
eleventh

Or you could add items with non-numeric keys if they were present:

>>> x['unexpected'] = 'something else'
>>> print(x['unexpected'])
something else

And yet have the object appear to behave properly if you access it using iterators or other methods of your choosing:

>>> print(x)
['first', 'second', None, None, None, None, None, None, None, None, 'eleventh', 'unexpected']
>>> print(len(x))
12

This is just an example, and you would need to tailor such a class to meet the needs of your application. For example, the resulting object does not strictly behave like a list (x[len(x)-1] is not the last item, for example). However, your application may not need such strict adherence, and if you are careful and plan properly, you can create an object which both provides highly optimized storage while leaving room for evolving data structure needs in the future.

Gary Wisniewski
  • 1,080
  • 10
  • 9
3

dict uses a lot more memory that a list. Probably not enough to be a concern if the computer isn't very busy. There are exceptions of course - if it's a web server with 100 connections per second, you may want to consider saving memory at the expense of readability

>>> L = range(400000)
>>> sys.getsizeof(L)
3200072   # ~3 Megabytes
>>> D = dict(zip(range(400000), range(400000)))
>>> sys.getsizeof(D)
25166104  # ~25 Megabytes
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • You aren't accounting for the fact that the asker might want to perform lookup and read from these data – David Heffernan Jul 01 '15 at 05:32
  • 1
    Thanks. I wasn't aware of `sys.getsizeof()` method before. This answer gives me *some* imagination in terms of memory. – Sait Jul 01 '15 at 05:33
  • @DavidHeffernan, both have `O(1)` lookup – John La Rooy Jul 01 '15 at 05:35
  • Yes. You didn't mention that. Nor did you mention that list lookup is a lot faster. – David Heffernan Jul 01 '15 at 05:36
  • @DavidHeffernan, not so much once you have to add special cases for the non integer "keys" – John La Rooy Jul 01 '15 at 05:38
  • A list doesn't have non-integer keys. It has integer keys that are zero based and contiguous. – David Heffernan Jul 01 '15 at 05:39
  • @DavidHeffernan, OP says there may be situations where non-integer keys are required which means it's a bit of a hack to add that ability if you've chosen to go with a list for the rest – John La Rooy Jul 01 '15 at 05:43
  • If we can't use lists then why are you reporting on their memory usage? You seem to want to have it both ways. You seem to have neglected computational efficiency entirely and only considered memory usage. – David Heffernan Jul 01 '15 at 05:45
  • As @DavidHeffernan says, the OP also say that he would need to do operations very frequently. A dictionary will hash its keys so everything is likely to be O(1). Lists not so much. And if the OP hacks list for non integer or non contiguous keys, it will most likely be sub optimal? For example, an optimal incest on a list is O(n). – xyz Jul 01 '15 at 08:54
1

Not exactly the spot on answer for your not so clear question, but here are my thoughts:

You said

I am analyzing large number of items (>400k)

In that case, I'd advise you to use generators and/or process your date in chunks.

Better option would be to put your data, which are key-value pairs, in Redis and take out chunks of it at a time. Redis can handle your volume of data very easily.

You could write a script that processes one chunk at a time, and using the asyncio module, you could parallelize the chunk processing.

Something like this:

    from concurrent import futures

    def chunk_processor(data):
        """
        Process your list data here
        """
        pass

    def parallelizer(map_func, your_data_list, n_workers=3):
        with futures.ThreadPoolExecutor(max_workers=n_workers) as executor:
            for result in executor.map(map_func, your_data_list):
                  # Do whatever with your result

    # Do the take out chunks of your data from Redis here
    chunk_of_list = get_next_chunk_from_redis()

    # Your processing starts here
    parallelizer(chunk_processor, your_data_list)

Again, something better could be done, but I'm presenting you one of the ways to go about it.

bad_keypoints
  • 1,382
  • 2
  • 23
  • 45
1

Lists are what they seem - a list of values, but in a dictionary, you have an 'index' of words, and for each of them a definition.

Dictionaries are the same, but the properties of a dict are different than lists because they work with mapping keys to values. That means you use a dictionary when:

  • You have to retrieve things based on some identifier, like names, addresses, or anything that can be a key.
  • You don't need things to be in order. Dictionaries do not normally have any notion of order, so you have to use a list for that.
  • You are going to be adding and removing elements and their keys.

Efficiency constrains are discussed at Stack posts Link1 & Link2.

Go for a dictionary as you have doubts regarding future values also there is no memory constrains to bother

Reference

Community
  • 1
  • 1
Tharif
  • 13,794
  • 9
  • 55
  • 77