0

I have the following Python code:

H = np.zeros(shape=(N-q+1,q),dtype=complex)
for i in range(0,N-q+1):
    H[i,:] = u[i:q+i]

where N and q are constants and u is a vector long enough so no out of bounds error would occur when u[i:q+i].

I have tried to optimize the code by using list comprehension,

H = np.asarray([u[i:q+i] for i in range(0,N-q+1)])

but np.asarray() makes it slower than previous code.

Any idea in order to optimize the assignation of column values?

DOMiguel
  • 103
  • 1
  • I remember there being an existing routine that does this or something very closely related, but I can't remember what it was. The simplicity of the sequence of `i` values means you could do this with explicit stride manipulation, if there isn't an existing routine. – user2357112 Apr 06 '16 at 21:51
  • Wait, no, I was thinking of something else. – user2357112 Apr 06 '16 at 21:57
  • Can you add an example that actually runs? – Chiel Apr 06 '16 at 22:01

1 Answers1

2

You could use stride.as_strided:

import numpy.lib.stride_tricks as stride

s = u.strides[0]
H2 = stride.as_strided(u, shape=(N-q+1,q), strides=(s, s)).astype(complex)

Using strides=(s, s) is the key -- in particular, making the first stride s means that each row of H2 advances the index into u by the number of bytes needed to advance one item. Hence the rows repeat, albeit shifted by one.


For example,

import numpy as np
import numpy.lib.stride_tricks as stride

N, q = 10**2, 6
u = np.arange((N-q+1)*(N))

def using_loop(u):
    H = np.zeros(shape=(N-q+1,q),dtype=complex)
    for i in range(0,N-q+1):
        H[i,:] = u[i:q+i]
    return H

def using_stride(u):
    s = u.strides[0]
    H2 = stride.as_strided(u, shape=(N-q+1,q), strides=(s, s)).astype(complex)
    return H2

H = using_loop(u)
H2 = using_stride(u)
assert np.allclose(H, H2)

Since stride.as_strided avoids the Python for-loop, using_stride is faster than using_loop. The advantage grows as N-q (the number of iterations) increases.

With N = 10**2 using_stride is 5x faster:

In [119]: %timeit using_loop(u)
10000 loops, best of 3: 61.6 µs per loop

In [120]: %timeit using_stride(u)
100000 loops, best of 3: 11.9 µs per loop

With N = 10**3 using_stride is 28x faster:

In [122]: %timeit using_loop(u)
1000 loops, best of 3: 636 µs per loop

In [123]: %timeit using_stride(u)
10000 loops, best of 3: 22.4 µs per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • How is the performance of this? – Chiel Apr 06 '16 at 22:02
  • Pretty cool. Do you see any more room for improvement? – Chiel Apr 06 '16 at 22:04
  • My python says about your stride example: The slowest run took 6.84 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 9.66 µs per loop. This makes the result suspicious – Chiel Apr 06 '16 at 22:05
  • I mean that there is a large spread in the samples that come out of the stride routine. I do reproduce the huge performance increase, so your method is great. – Chiel Apr 06 '16 at 22:10
  • @Chiel: The warning may be due to the array(s) being in the CPU cache. See http://stackoverflow.com/q/29759883/190597. Since I get the same warning for `using_loop` and `using_stride`, and they use the same arrays, I think the comparison is fair. – unutbu Apr 06 '16 at 22:14
  • Using `u.itemsize` for the strides has the issue of assuming the input is contiguous. It's more robust to take the stride from `u.strides`. – user2357112 Apr 07 '16 at 00:09
  • @user2357112: Good point. Thanks for the correction. – unutbu Apr 07 '16 at 00:15
  • Thanks! Huge improvement. – DOMiguel Apr 07 '16 at 08:15