Python difference between mutating and re-assigning a list ( _list = and _list[:] = )

Question

So I frequently write code following a pattern like this:

_list = list(range(10)) # Or whatever
_list = [some_function(x) for x in _list]
_list = [some_other_function(x) for x in _list]

etc

I saw now on a different question a comment that explained how this approach creates a new list each time and it is better to mutate the existing list, like so:

_list[:] = [some_function(x) for x in _list]

It's the first time I've seen this explicit recommendation and I'm wondering what the implications are:

Does the mutation save memory? Presumably the references to the "old" list would drop to zero after a re-assignment and the "old" list is disregarded, but is there a delay before that happens where I'm potentially using more memory than I need to when I use re-assignment instead of mutating the list?
Is there a computational cost to using mutation? I suspect changing something inplace is more expensive than creating a new list and just dropping the old one?

In terms of safety, I wrote a script to test this:

def some_function(number: int):
    return number*10

def main():
    _list1 = list(range(10))
    _list2 = list(range(10))

    a = _list1
    b = _list2 

    _list1 = [some_function(x) for x in _list1]
    _list2[:] = [some_function(x) for x in _list2]

    print(f"list a: {a}")
    print(f"list b: {b}")


if __name__=="__main__":
    main()

Which outputs:

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list b: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

So mutation does seem to have the drawback of being more likely to cause side effects. Although these might be desirable. Are there any PEPs that discuss this safety aspect, or other best practice guides?

Thank you.

EDIT: Conflicting Answers: So more tests on memory So I have received two conflicting answers so far. In the comments, jasonharper has written that the right hand side of an equation does not know about the left hand side, and therefore memory usage cannot possibly be affected by what appears on the left. However, in the answers, Masoud has written that "when [reassignment] is used, two new and old _lists with two different identities and values are created. Afterward, old _list is garbage collected. But when a container is mutated, every single value is retrieved, changed in CPU and updated one-by-one. So the list is not duplicated." This seems to indicate that there is a big memory cost to doing reassignment.

I decided to try using memory-profiler to dig deeper. Here is the test script:

from memory_profiler import profile


def normalise_number(number: int):
    return number%1000


def change_to_string(number: int):
    return "Number as a string: " + str(number) + "something" * number


def average_word_length(string: str):
    return len(string)/len(string.split())


@profile(precision=8)
def mutate_list(_list):
    _list[:] = [normalise_number(x) for x in _list]
    _list[:] = [change_to_string(x) for x in _list]
    _list[:] = [average_word_length(x) for x in _list]


@profile(precision=8)
def replace_list(_list):
    _list = [normalise_number(x) for x in _list]
    _list = [change_to_string(x) for x in _list]
    _list = [average_word_length(x) for x in _list]
    return _list


def main():
    _list1 = list(range(1000))
    mutate_list(_list1)

    _list2 = list(range(1000))
    _list2 = replace_list(_list2)

if __name__ == "__main__":
    main()

Please note that I am aware that, eg, this the find average word length function isn't particularly well written. Just for testing sake.

Here are the results:

Line #    Mem usage    Increment   Line Contents
================================================
    16  32.17968750 MiB  32.17968750 MiB   @profile(precision=8)
    17                             def mutate_list(_list):
    18  32.17968750 MiB   0.00000000 MiB       _list[:] = [normalise_number(x) for x in _list]
    19  39.01953125 MiB   0.25781250 MiB       _list[:] = [change_to_string(x) for x in _list]
    20  39.01953125 MiB   0.00000000 MiB       _list[:] = [average_word_length(x) for x in _list]


Filename: temp2.py

Line #    Mem usage    Increment   Line Contents
================================================
    23  32.42187500 MiB  32.42187500 MiB   @profile(precision=8)
    24                             def replace_list(_list):
    25  32.42187500 MiB   0.00000000 MiB       _list = [normalise_number(x) for x in _list]
    26  39.11328125 MiB   0.25781250 MiB       _list = [change_to_string(x) for x in _list]
    27  39.11328125 MiB   0.00000000 MiB       _list = [average_word_length(x) for x in _list]
    28  32.46484375 MiB   0.00000000 MiB       return _list

What I found is that even if I increase the list size to like 100000, reassignment consistently uses more memory, but, like, only maybe 1% more. This makes me think that the additional memory cost is probably just an extra pointer somewhere, not the cost of an entire list.

To further test the hypothesis, I performed time based profiling at intervals of 0.00001 seconds and graphed the results. I wanted to see whether there was perhaps a momentary spike in memory usage that dissappeared instantly due to garbage collection (reference counting). But alas, I have not found such a spike.

Can anyone explain these results? What exactly is happening under the hood here that causes this slight but consistent increase in memory usage?

If you don't need to use the intermediate products, you can define them as generators instead: `_list = (some_function(x) for x in _list)` — Patrick Haugh, May 25 '19 at 20:50
`_list[:] = [some_function(x) for x in _list]` creates a brand new list - evaluation of the right side of an assignment knows nothing about what the left side will do with it. It then replaces the existing list contents with the new contents, and the new list is then disposed. `_list = ...` has exactly the same memory requirements, but is faster since it skips the delete/replace step. — jasonharper, May 25 '19 at 20:52
Okay @jasonharper so you are saying that in terms of resources, _list = has same memory but better CPU use? So no trade offs there? — Neil, May 25 '19 at 20:55
You might still need to use `_list[:] = ...` if something else has a reference to the original list, and you want that reference to be updated. After `_list = ...`, references to the old list are still references to the old list. — jasonharper, May 25 '19 at 21:00
There is nothing 'unsafe' about assigning to a list slice, it is a standard operation. I think your concerns about performance are premature optimisation. If you want the same list with the same memory address then use slice assignment, otherwise there is no need to — Chris_Rands, May 27 '19 at 20:56

score 3 · Accepted Answer · answered Jun 02 '19 at 14:40

It's hard to answer this canonically because the actual details are implementation-dependent or even type-dependent.

For example in CPython when an object reaches reference-count zero then it's disposed and the memory is freed immediately. However some types have an additional "pool" that references instances without you knowing it. For example CPython has a "pool" of unused list instances. When the last reference of a list is dropped in Python code it may be added to this "free list" instead of releasing the memory (one would need to invoke something PyList_ClearFreeList to reclaim that memory).

But a list is not just the memory that is needed for the list, a list contains objects. Even when the memory of the list is reclaimed the objects that were in the list could remain, for example there is still a reference to that object somewhere else, or that type itself has also a "free list".

If you look at other implementations like PyPy then even in the absence of a "pool" an object isn't disposed of immediately when no-one references it anymore, it's only disposed of "eventually".

So how does this relate to your examples you may wonder.

Let's have a look at your examples:

_list = [some_function(x) for x in _list]

Before this line runs there is one list instance assigned to the variable _list. Then you create a new list using the list-comprehension and assign it to the name _list. Shortly before this assign there are two lists in memory. The old list and the list created by the comprehension. After the assignment there is one list referenced by the name _list (the new list) and one list with a reference count that has been decremented by 1. In case the old list isn't referenced anywhere else and thus reached a reference count of 0, it may be returned to the pool, it may be disposed or it may be disposed eventually. Same for the contents of the old list.

What about the other example:

_list[:] = [some_function(x) for x in _list]

Before this line runs there is again one list assigned to the name _list. When the line executes it also creates a new list through the list comprehension. But instead of assigning the new list to the name _list it's going to replace the contents of the old list with those of the new list. However while it's clearing the old list it will have two lists that are kept in memory. After this assignment the old list is still available through the name _list but the list created by the list-comprehension isn't referenced anymore, it reaches a reference count of 0 and what happens to it depends. It can be put in the "pool" of free lists, it could be disposed immediately, it could also be disposed at some unknown point in the future. Same for the original contents of the old list which were cleared.

So where is the difference:

Actually there is not a lot of difference. In both cases Python has to keep two lists completely in memory. However the first approach will release the reference to the old list faster than the second approach will release the reference to the intermediate list in memory, simply because it has to be kept alive while the contents are copied.

However releasing the reference faster will not guarantee that it actually results in "less memory" since it might be returned to the pool or the implementation only frees memory at some (unknown) point in the future.

A less memory expensive alternative

Instead of creating and discarding lists you could chain iterators/generators and consume them when you need to iterate them (or you need the actual list).

So instead of doing:

_list = list(range(10)) # Or whatever
_list = [some_function(x) for x in _list]
_list = [some_other_function(x) for x in _list]

You could do:

def generate_values(it):
    for x in it:
        x = some_function(x)
        x = some_other_function(x)
        yield x

And then simply consume that:

for item in generate_values(range(10)):
    print(item)

Or consume it with a list:

list(generate_values(range(10)))

These will not (except when you pass it to list) create any lists at all. A generator is a state-machine that processes the elements one at a time when requested.

Masoud · Answer 2 · 2019-05-26T17:13:44.693

According to CPython documentation :

Some objects contain references to other objects; these are called containers. Examples of containers are tuples, lists and dictionaries. The references are part of a container’s value. In most cases, when we talk about the value of a container, we imply the values, not the identities of the contained objects; however, when we talk about the mutability of a container, only the identities of the immediately contained objects are implied.

So when a list is mutated, the references contained in the list are mutated, while the identity of the object is unchanged. Interestingly, while mutable objects with identical values are not allowed to have the same identity, identical immutable objects can have similar identity (because they are immutable!).

a = [1, 'hello world!']
b = [1, 'hello world!']
print([hex(id(_)) for _ in a])
print([hex(id(_)) for _ in b])
print(a is b)

#on my machine, I got:
#['0x55e210833380', '0x7faa5a3c0c70']
#['0x55e210833380', '0x7faa5a3c0c70']
#False

when code:

_list = [some_function(x) for x in _list]

is used, two new and old _lists with two different identities and values are created. Afterward, old _list is garbage collected. But when a container is mutated, every single value is retrieved, changed in CPU and updated one-by-one. So the list is not duplicated.

Regarding processing efficiency, its easily comparable:

import time

my_list = [_ for _ in range(1000000)]

start = time.time()
my_list[:] = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.0968618392944336 s


start = time.time()
my_list = [_ for _ in my_list]
print(time.time()-start)  # on my machine 0.05194497108459473 s

update: A list can be considered to be made of two parts: references to (id of) other objects and references value. I used a code to demonstrate the percentage of memory that a list object directly occupies to total memory consumed (list object + referred objects):

import sys
my_list = [str(_) for _ in range(10000)]

values_mem = 0
for item in my_list:
    values_mem+= sys.getsizeof(item)

list_mem = sys.getsizeof(my_list)

list_to_total = 100 * list_mem/(list_mem+values_mem)
print(list_to_total) #result ~ 14%

You seem to be suggesting that there are memory implications then? — Neil, May 26 '19 at 09:50

score 2 · Answer 3 · answered Jun 02 '19 at 13:45

TLDR: You can't modify the list in-place in Python without doing some kind of loop yourself or using an external library, but it probably isn't worth trying for memory-saving reasons anyway (premature optimisation). What might be worth trying is using the Python map function and iterables, which don't store the results at all, but compute them on demand.

There are several ways to apply a modifying function across a list (i.e. performing a map) in Python, each with different implications for performance and side-effects:

New list

This is what both options in the question are actually doing.

[some_function(x) for x in _list]

This creates a new list, with values populated in order by running some_function on the corresponding value in _list. It can then be assigned as a replacement for the old list (_list = ...) or have its values replaces the old values, while keeping the object reference the same (_list[:] = ...). The former assignment happens in constant time and memory (it is just a reference replacement after all), where the second one has to iterate through the list to perform the assignment, which is linear in time. However, the time and memory required to create the list in the first place are both linear, so _list = ... is strictly faster than _list[:] = ..., but it's still linear in time and memory so it doesn't really matter.

From a functional point of view, the two variants of this option have potentially dangerous consequences through side-effects. _list = ... leaves the old list hanging around, which isn't dangerous, but does mean that memory might not be freed. Any other code references to _list will immediately get the new list after the change, which again is probably fine, but might cause subtle bugs if you're not paying attention. list[:] = ... changes the existing list, so anyone else with a reference to it will have the values change under their feet. Bear in mind that if the list is ever returned from a method, or passed outside the scope you're working in, you might not know who else is using it.

The bottom line is that both of these methods are linear in both time and memory because they copy the list, and have side-effects which need to be considered.

In-place substitution

The other possibility hinted at in the question is changing the values in place. This would save on the memory of a copy of the list. Unfortunately there's no built-in function for doing this in Python, but it's not difficult to do it manually (as offered in various answers to this question).

for i in range(len(_list)):
    _list[i] = some_function(_list[i])

Complexity-wise, this still has the linear time cost of performing the calls to some_function, but saves on the extra memory of keeping two lists. If it isn't referenced elsewhere, each item in the old list can be garbage collected as soon as it's been replaced.

Functionally, this is perhaps the most dangerous option, because the list is kept in an inconsistent state during the calls to some_function. As long as some_function makes no reference to the list (which would be pretty horrible design anyway), it should be as safe as the new list variety solutions. It also has the same dangers as the _list[:] = ... solution above, because the original list is being modified.

Iterables

The Python 3 map function acts on iterables rather than lists. Lists are iterables, but iterables aren't always lists, and when you call map(some_function, _list), it doesn't immediately run some_function at all. It only does it when you try to consume the iterable in some way.

list(map(some_other_function, map(some_function, _list)))

The code above applies some_function, followed by some_other_function to the elements of _list, and puts the results in a new list, but importantly, it doesn't store the intermediate value at all. If you only need to iterate on the results, or calculate a maximum from them, or some other reduce function, you won't need to store anything along the way.

This approach fits with the functional programming paradigm, which discourages side-effects (often the source of tricky bugs). Because the original list is never modified, even if some_function did make reference to it beyond the item it's considering at the time (which is still not good practice by the way), it wouldn't be affected by the ongoing map.

There are lots of functions for working with iterables and generators in the Python standard library itertools.

A note on parallelisation

It's very tempting to consider how performing a map on a list could be parallelised, to reduce the linear time cost of the calls to some_function by sharing it between multiple cpus. In principle, all of these methods can be parallelised, but Python makes it quite difficult to do. One way to do it is using the multiprocessing library, which has a map function. This answer describes how to use it.