NumPy: Why the need to explicitly copy a value?

Question

arr = np.arange(0,11)
slice_of_arr = arr[0:6]
slice_of_arr[:]=99

# slice_of_arr returns
array([99, 99, 99, 99, 99, 99])
# arr returns
array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

As the example shown above, you cannot directly change the value of the slice_of_arr, because it's a view of arr, not a new variable.

My questions are:

Why does NumPy design like this? Wouldn't it be tedious every time you need to .copy and then assign value?
Is there anything I can do, to get rid of the .copy? How can I change this default behavior of NumPy?

It's designed like that for performance, and implementing e.g. copy-on-write would probably be even more confusing. Why is this design a problem to you? — Krumelur, Jul 23 '15 at 07:56
@Krumelur Because in most non-scientific programming scenario, that's how it works. And I think there is also no equivalence in `matlab`. I have to `remember` this behavior, and I really don't like `repetation` in programming. — ZK Zhao, Jul 23 '15 at 08:02
OK, fair enough. I guess it would be a trade-off between the tidiness of code slicing arrays and code doing, e.g. `doSomething(arr[:6000])`. I think Matlab does CoW to solve this. — Krumelur, Jul 23 '15 at 08:12
The answer is exactly like @Krumelur said: NumPy is optimized to perform well. Unnecessary copying takes time, and perhaps more importantly also increases data size. Increased data size results in fewer cache hits, and cache hit rates are **the** most important part of optimizing throughput on modern CPUs. If you need a copy, take a copy *explicitly*, but it's better to know you've done that than to wonder why suddenly your performance went through the floor after you added a new variable. — Borealid, Jul 23 '15 at 08:18
@Borealid,Yes, but when I'm experimenting with small data set, this is really annoying. How can I change this default behavior? For example, I `want` cleaner code rather than performance, and I'm happy with the performance issue it may bring. Is this possible? — ZK Zhao, Jul 23 '15 at 08:21
@cqcn1991 Might I suggest not using NumPy if you prefer cleaner code over fast performance? NumPy is designed for large-scale high performance scientific computing. — Borealid, Jul 23 '15 at 08:23
@cqcn1991 To elaborate: just putting in a switch to change the behavior is itself a performance overhead. Checking which position that switch is in each time you do an array assignment costs runtime performance. That cost comes out of the pocket of people who are writing NumPy code mindfully. Every project chooses what target they optimize for; NumPy optimizes for uncompromising runtime performance while still using the Python language. — Borealid, Jul 23 '15 at 08:26

score 3 · Answer 1 · answered Jun 23 '18 at 23:55

I think you have the answers in the other comments, but more specifically:

1.a. Why does NumPy design like this?
Because it's way faster (constant time) to create a view rather than creating a whole array (linear time).

1.b. Wouldn't it be tedious every time you need to .copy and then assign value?
Actually it's not that common to need to create a copy. So no, it's not tedious. Even if it can be surprising at first this design is very good.

2.a. Is there anything I can do, to get rid of the .copy?
I can't really tell without seing real code. In the toy example you give, you can't avoid creating a copy, but in real code you usually apply functions to the data, which return another array so a copy isn't needed.
Can you give an example of real code where you need to call .copy repeatedly ?

2.b. How can I change this default behavior of NumPy?
You can't. Try to get used to it, you'll see how powerfull it is.

score 1 · Answer 2 · edited May 23 '17 at 12:06

What does (numpy) __array_wrap__ do?

talks about ndarray subclasses and hooks like __array_wrap__. np.array takes copy parameter, forcing the result to be a copy, even if it isn't required by other considerations. ravel returns a view, flatten a copy. So it is probably possible, and maybe not too difficult, to construct a ndarray subclass that forces a copy. It may involve modifying a hook like __array_wrap__.

Or maybe modifying the .__getitem__ method. Indexing as in slice_of_arr = arr[0:6] involves a call to __getitem__. For ndarray this is compiled, but for a masked array, it is python code that you could use as an example:

/usr/lib/python3/dist-packages/numpy/ma/core.py

It may be something as simple as

def __getitem__(self, indx):
    """x.__getitem__(y) <==> x[y]
    """
    # _data = ndarray.view(self, ndarray) # change to:
    _data = ndarray.copy(self, ndarray)
    dout = ndarray.__getitem__(_data, indx)
    return dout

But I suspect that by the time you develop and fully test such a subclass, you might fall in love with the default no-copy approach. While this view-v-copy business bites many new comers (especially if coming from MATLAB), I haven't seen complaints from experienced users. Look at other numpy SO questions; you won't see a lot copy() calls.

Even regular Python users are used asking themselves whether a reference or slice is a copy or not, and whether something is mutable or not.

for example with lists:

In [754]: ll=[1,2,[3,4,5],6]
In [755]: llslice=ll[1:-1]
In [756]: llslice[1][1:2]=[10,11,12]
In [757]: ll
Out[757]: [1, 2, [3, 10, 11, 12, 5], 6]

modifying an item an item inside a slice modifies that same item in the original list. In contrast to numpy, a list slice is a copy. But it's a shallow copy. You have to take extra effort to make a deep copy (import copy).

/usr/lib/python3/dist-packages/numpy/lib/index_tricks.py contains some indexing functions aimed at making certain indexing operations more convenient. Several are actually classes, or class instances, with custom __getitem__ methods. They may also serve as models of how to customize your slicing and indexing.

NumPy: Why the need to explicitly copy a value?

2 Answers2