1

How can a get a 2D array containing all possible consecutive sub-arrays of a certain length?

For example, say my array was ['a', 'b', 'c', 'd', 'e'], and n was 3, the result should be

[['a', 'b', 'c']
 ['b', 'c', 'd']
 ['c', 'd', 'e']]

I found a similar question relating to python lists, however I'd like to do this with numpy, as I need to perform this on many different arrays, each of which are fairly large. Basically, speed is an issue here.

Community
  • 1
  • 1
Jezzamon
  • 1,453
  • 1
  • 15
  • 27
  • 2
    You could use `as_strided` (see [here](http://stackoverflow.com/questions/4936620/using-strides-for-an-efficient-moving-average-filter) for an example implementation-- `rolling_window_lastaxis(a, 3)` gives your output.) – DSM Feb 03 '16 at 04:42

6 Answers6

2

Third and final no-loop answer:

def substrings(n, x)
  return numpy.fromfunction(lambda i, j: x[i + j], (len(x) - n + 1, n), 
                            dtype=int)

You'll have to profile all of these solutions yourself to find the one that's most performant. If you like one of these solutions best, please select it as the correct answer.

castle-bravo
  • 1,389
  • 15
  • 34
  • This is the fastest of the methods so far, but still seems to be about the same speed as the list comprehension method. I'll leave the question open for a little longer in case anything better comes up, but otherwise this is the winner :) – Jezzamon Feb 03 '16 at 05:10
1

No loops? Okay, we'll use recursion:

def substrings(n, x):
  if len(x) < n:
    return []

  return [x[:n]] + substrings(n, x[1:])

You can easily modify the above to return arrays:

return array([x[:n]] + substrings(n, x[1:]))

Be warned, if the arrays are very large, you'll exceed your maximum recursion depth and the stack will overflow.

castle-bravo
  • 1,389
  • 15
  • 34
  • haha, the reason I didn't want to use loops was for performance reasons (i.e. I wanted to do it with vector operations somehow). I imagine this will be considerably slower than looping. I've edited the question slightly. – Jezzamon Feb 03 '16 at 04:34
  • See my other answer for a vector/matrix approach. There might be a performance advantage when using that on very large arrays, but for small ones, the advantage will be subsumed by the cost of initializing the array. Ideally, we would have a lazy sparse arrays, like the ones implemented in this [package](https://pypi.python.org/pypi/lazyarray). I haven't used it, but it might be worth experimenting with. – castle-bravo Feb 03 '16 at 04:40
0

Here's another way to do it without writing loops into your code. Initialize a 3-dimensional array with True values in the diagonal plane i == j + k, and take the matrix-vector product with the array.

from numpy import *

def substrings(n, x):
  A = fromfunction(lambda k, j, i: i == j + k,
                   (len(x) - n + 1, n, len(x)))
  return A.dot(x)

This also suffers from some performance issues, but you might be able to make it better by using one of scipy's sparse matrix classes instead of the dense one provided by numpy.

castle-bravo
  • 1,389
  • 15
  • 34
  • 1
    This is clever! However, I think either the lambda calculations or memory allocation slow it down? This one-liner seems about 3 times faster `np.array([x[i:i+n] for i in range(len(x)-n+1)])` – Jezzamon Feb 03 '16 at 05:04
  • You've found the most succinct answer. I think that'll be the fastest one too. – castle-bravo Feb 03 '16 at 05:07
0

Since loops are back on the table:

>>> n = 3
>>> 
>>> array = ['a', 'b', 'c', 'd', 'e']
>>> 
>>> permutations = (array[j:j + n] for j in range(0, len(array) - (n - 1)))
>>> 
>>> list(permutations)
[['a', 'b', 'c'], ['b', 'c', 'd'], ['c', 'd', 'e']]

A more compact variation on this theme:

permutations = (array[j:][0:n] for j in range(len(array) - n + 1))
cdlane
  • 40,441
  • 5
  • 32
  • 81
0

My no (explicit) loops solution:

>>> n = 3
>>> 
>>> array = ['a', 'b', 'c', 'd', 'e']
>>> 
>>> permutations = zip(*map(lambda x: array[x:], range(n)))
>>> 
>>> list(permutations)
[('a', 'b', 'c'), ('b', 'c', 'd'), ('c', 'd', 'e')]
>>> 
cdlane
  • 40,441
  • 5
  • 32
  • 81
0

as mentioned earlier in one of the comments, the fastest approach is to use as_strided function as below (see here)

def subarray1(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

The alternative proposal is significantly slower

def subarray2(a, window):
    return np.fromfunction(lambda i, j: a[i + j], (len(a) - window + 1, window), 
                            dtype=int)

Lets compare their performance:

window = 10
a = np.arange(10000)

%timeit subarray1(a, window) 
4.36 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit subarray1(a, window)
902 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Amir
  • 307
  • 3
  • 14