1

I am trying to understand why filling a 64 bit array is slower than filling a 32 bit array.

Here are the examples:

@cython.boundscheck(False)
@cython.wraparound(False)
def test32(long long int size):
   cdef np.ndarray[np.int32_t,ndim=1] index = np.zeros(size, np.int32)
   cdef Py_ssize_t i

   for i in range(size):
       index[i] = i

   return indx

@cython.boundscheck(False)
@cython.wraparound(False)
def test64(long long int size):
   cdef np.ndarray[np.int64_t,ndim=1] index = np.zeros(size, np.int64)
   cdef Py_ssize_t i

   for i in range(size):
       index[i] = i

   return indx

The timings:

In [4]: %timeit test32(1000000)
1000 loops, best of 3: 1.13 ms over loop

In [5]: %timeit test64(1000000)
100 loops, best of 3: 2.04 ms per loop

I am using a 64 bit computer and the Visual C++ for Python (9.0) compiler.

Edit:

Initializing a 64bit array and a 32bit array seems to be taking the same amount of time, which means the time difference is due to the filling process.

In [8]: %timeit np.zeros(1000000,'int32')
100000 loops, best of 3: 2.49 μs per loop

In [9]: %timeit np.zeros(1000000,'int64')
100000 loops, best of 3: 2.49 μs per loop

Edit2:

As DavidW points out, this behavior can be replicated with np.arange, which means that this is expected:

In [7]: %timeit np.arange(1000000,dtype='int32')
10 loops, best of 3: 1.22 ms per loop

In [8]: %timeit np.arange(1000000,dtype='int64')
10 loops, best of 3: 2.03 ms per loop
snowleopard
  • 717
  • 8
  • 19
  • Have you had a look at the C code that is generated by Cython for each function? That might offer some clues. – Alex Riley May 11 '17 at 16:18
  • 2
    Without beeing an expert for such low-level performance optimizations: is it really surprising ? You are filling the double amount of memory after all, which is quite exactly represented by your timings. – sascha May 11 '17 at 16:18
  • After your edit: (1) if the size is the same for your tests and np.zeros(), it would be surprising to see np is slower by a factor > 2 (int the int32 case). (2) Just for fun: what happens with np.full() instead of np.zeros() (maybe np does something special in the zero case which could be checked by reading the sources). (3) Maybe try smaller test-cases (with some effect on the benchmarking-process) to see if the initial memory-allocation is relevant. – sascha May 11 '17 at 16:29
  • I looked at the C code and see no difference between the two. Also the two methods below have a very similar speed in pure python. I tried np.zeros with different sizes and see no difference. – snowleopard May 11 '17 at 16:45
  • 2
    If I measure numpy's `arange` (which does basically the same thing as your function) I get the 64 bit version running at half the speed of the 32 bit version, which suggests it's not easily avoided. Similarly for `np.full`. I think it's probably just the time to write double the data (and for some reason zeroing it isn't affected). – DavidW May 11 '17 at 17:16
  • Great point, see edit. – snowleopard May 11 '17 at 17:25

1 Answers1

3

A quick set of measurements (which I initially posted in the comments) suggests that you see very similar behaviour (64 bits taking twice as long as 32 bits) for np.full, np.ones and np.arange, with all three showing similar times to each other (arange being about 10% slower than the other 2 for me).

I believe this suggests that this is expected behaviour and just the time it takes to fill the memory. (64 bits obviously has twice as much memory as 32 bits).

The interesting question is then why np.zeros is so uniform (and fast) - a complete answer is probably given in this C-based question but the basic summary is that allocating zeros (as done by the C function calloc) can be done lazily - i.e. it doesn't actually allocate or fill much until you actually try to write to the memory.

Community
  • 1
  • 1
DavidW
  • 29,336
  • 6
  • 55
  • 86
  • I'd encourage you not to accept this for the moment and to wait a day or so to see if you get anyone who's more of an expert on the C side of it.... – DavidW May 11 '17 at 17:41
  • Very good points. `np.zeros` is what threw me off. The link you provided seems to provide a reasonable explanation for the efficiency. – snowleopard May 11 '17 at 17:47