27

Either ndarray.reshape or numpy.newaxis can be used to add a new dimension to an array. They both seem to create a view, is there any reason or advantage to use one instead of the other?

>>> b
array([ 1.,  1.,  1.,  1.])
>>> c = b.reshape((1,4))
>>> c *= 2
>>> c
array([[ 2.,  2.,  2.,  2.]])
>>> c.shape
(1, 4)
>>> b
array([ 2.,  2.,  2.,  2.])
>>> d = b[np.newaxis,...]
>>> d
array([[ 2.,  2.,  2.,  2.]])
>>> d.shape
(1, 4)
>>> d *= 2
>>> b
array([ 4.,  4.,  4.,  4.])
>>> c
array([[ 4.,  4.,  4.,  4.]])
>>> d
array([[ 4.,  4.,  4.,  4.]])
>>> 

`

wwii
  • 23,232
  • 7
  • 37
  • 77

2 Answers2

31

One reason to use numpy.newaxis over ndarray.reshape is when you have more than one "unknown" dimension to operate with. So, for example, for the following array:

>>> arr.shape
(10, 5)

This works:

>>> arr[:, np.newaxis, :].shape
(10, 1, 5)

But this does not:

>>> arr.reshape(-1, 1, -1)
...
ValueError: can only specify one unknown dimension
Rafael Martins
  • 548
  • 5
  • 8
25

I don't see evidence of much difference. You could do a time test on very large arrays. Basically both fiddle with the shape, and possibly the strides. __array_interface__ is a nice way of accessing this information. For example:

In [94]: b.__array_interface__
Out[94]: 
{'data': (162400368, False),
 'descr': [('', '<f8')],
 'shape': (5,),
 'strides': None,
 'typestr': '<f8',
 'version': 3}

In [95]: b[None,:].__array_interface__
Out[95]: 
{'data': (162400368, False),
 'descr': [('', '<f8')],
 'shape': (1, 5),
 'strides': (0, 8),
 'typestr': '<f8',
 'version': 3}

In [96]: b.reshape(1,5).__array_interface__
Out[96]: 
{'data': (162400368, False),
 'descr': [('', '<f8')],
 'shape': (1, 5),
 'strides': None,
 'typestr': '<f8',
 'version': 3}

Both create a view, using the same data buffer as the original. Same shape, but reshape doesn't change the strides. reshape lets you specify the order.

And .flags shows differences in the C_CONTIGUOUS flag.

reshape may be faster because it is making fewer changes. But either way the operation shouldn't affect the time of larger calculations much.

e.g. for large b

In [123]: timeit np.outer(b.reshape(1,-1),b)
1 loops, best of 3: 288 ms per loop
In [124]: timeit np.outer(b[None,:],b)
1 loops, best of 3: 287 ms per loop

Interesting observation that: b.reshape(1,4).strides -> (32, 8)

Here's my guess. .__array_interface__ is displaying an underlying attribute, and .strides is more like a property (though it may all be buried in C code). The default underlying value is None, and when needed for calculation (or display with .strides) it calculates it from the shape and item size. 32 is the distance to the end of the 1st row (4x8). np.ones((2,4)).strides has the same (32,8) (and None in __array_interface__.

b[None,:] on the other hand is preparing the array for broadcasting. When broadcasted, existing values are used repeatedly. That's what the 0 in (0,8) does.

In [147]: b1=np.broadcast_arrays(b,np.zeros((2,1)))[0]

In [148]: b1.shape
Out[148]: (2, 5000)

In [149]: b1.strides
Out[149]: (0, 8)

In [150]: b1.__array_interface__
Out[150]: 
{'data': (3023336880L, False),
 'descr': [('', '<f8')],
 'shape': (2, 5),
 'strides': (0, 8),
 'typestr': '<f8',
 'version': 3}

b1 displays the same as np.ones((2,5)) but has only 5 items.

np.broadcast_arrays is a function in /numpy/lib/stride_tricks.py. It uses as_strided from the same file. These functions directly play with the shape and strides attributes.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Cool, ... ```__array_interface__```. !! – wwii Feb 07 '15 at 18:53
  • Hmmm ```b.reshape(1,4).strides -> (32, 8)```, ```b[None,...].strides ->(0, 8)``` – wwii Feb 07 '15 at 19:00
  • Interesting. I've added some thoughts on that. – hpaulj Feb 07 '15 at 20:23
  • ```...preparing the array for broadcasting``` sounds right. So the zero for the first stride dimension/value kind of forces it to *start from the beginning* during broadcasting. that might explain the ```C_CONTIGUOUS``` difference also. – wwii Feb 07 '15 at 21:55