5

Short version:

I'm trying to efficiently create an array like x:

input = [0, 1, 2, 3, 4, 5, 6]

x = [ [0,1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6] ]

I've tried simple for looping and it takes too long for the real usecase.

Long version:

(extends short version)

I've got a 400k rows long dataframe, which I need to partition into arrays of a next n elements from the element currently iterated over. Currently I group it just like presented below in the process_data function.

A simple for based iteration takes forever here (2.5min on my hardware to be specific). I've searched itertools and pandas documentation, tried searching here too and couldn't find any fitting solution.

My current super time consuming implementation:

class ModelInputParsing(object):
    def __init__(self, data):
        self.parsed_dataframe = data.fillna(0)
    
    def process_data(self, lb=50):
        self.X, self.Y = [],[]
        for i in range(len(self.parsed_dataframe)-lb):
            self.X.append(self.parsed_dataframe.iloc[i:(i+lb),-2])
            self.Y.append(self.parsed_dataframe.iloc[(i+lb),-1])
        return (np.array(self.X), np.array(self.Y))

The input data looks like this (where Bid is the mentioned input):

    Bid     Changes     Expected
0   1.20102 NaN         0.000000
1   1.20102 0.000000    0.000000
2   1.20102 0.000000    0.000042
3   1.20102 0.000000    0.000017
4   1.20102 0.000000    0.000025
5   1.20102 0.000000    0.000025
6   1.20102 0.000000    0.000100
...

And the output should look like this:

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          8.34465027e-06,  -8.34465027e-06,   0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
         -8.34465027e-06,   0.00000000e+00,   3.33786011e-05],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   3.33786011e-05,   0.00000000e+00],
       ..., 
       [  0.00000000e+00,   8.34465027e-06,   1.66893005e-05, ...,
         -8.34465027e-06,   0.00000000e+00,   0.00000000e+00],
       [  8.34465027e-06,   1.66893005e-05,  -8.34465027e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.66893005e-05,  -8.34465027e-06,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.66893005e-05]], dtype=float32)
len(x)
399950

Below I've presented x[0] and x[1]. Key here is how the the values move one position back in the next array. For example a first non-zero value moved from 7 to 6 position (0 based position).

The first element:

x[0]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -4.16040421e-05,   2.49147415e-05,
        -8.34465027e-06,   0.00000000e+00,  -7.49230385e-05,
         ...,
         2.50339508e-05,  -8.34465027e-06,   3.33786011e-05,
        -2.50339508e-05,  -8.34465027e-06,   8.34465027e-06,
        -8.34465027e-06,   0.00000000e+00], dtype=float32)
len(x[0])
50

The second element:

x[1]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -4.16040421e-05,   2.49147415e-05,  -8.34465027e-06,
         0.00000000e+00,  -7.49230385e-05,  -1.58131123e-04,
         ....,
        -8.34465027e-06,   3.33786011e-05,  -2.50339508e-05,
        -8.34465027e-06,   8.34465027e-06,  -8.34465027e-06,
         0.00000000e+00,   3.33786011e-05], dtype=float32)
len(x[1])
50

I'm curious if there is a way to get this done more efficiently as I'm soon planning to parse +20m rows long datasets.

trust512
  • 2,188
  • 1
  • 18
  • 18
  • Similar question: [Array into multiple arrays](https://stackoverflow.com/questions/49863276/array-into-multiple-arrays) – pault Apr 23 '18 at 19:05
  • 2
    So in actuality you have a NumPy array? Because in that case, something like [`numpy.lib.stride_tricks.as_strided`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.lib.stride_tricks.as_strided.html) will crush a Python solution performance-wise. – miradulo Apr 23 '18 at 19:28
  • @miradulo if OP plans on doing anything to the data in the arrays then that is pretty risky. Especially since pandas has a lot of vectorized operations – Grant Williams Apr 23 '18 at 19:52
  • @GrantWilliams Sure, depends on your use case. A fancy indexing solution that avoids said danger will also be far faster than a Python solution. – miradulo Apr 23 '18 at 20:42
  • @miradulo i implemented a solution using strides below, and i agree that it could definitely be perfect for this use case, but i just wanted to make sure OP read the documentation very clearly on it. Sometimes python programmers aren't used to having the speed and ability to shoot one's self in the foot that strides gives you :) – Grant Williams Apr 23 '18 at 20:44
  • Thanks @miradulo for providing this, I'll most definitely try it. – trust512 Apr 23 '18 at 20:46

5 Answers5

7

zip() plus some slicing can do that:

>>> list(zip(input[0:], input[1:], input[2:]))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

if you need the list elements to be lists, use this:

>>> list(map(list, zip(input[0:], input[1:], input[2:])))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

In general, if you need n-tuples instead of triples, you can do:

>>> list(zip(*(input[i:] for i in range(3))))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

or

>>> list(map(list, zip(*(input[i:] for i in range(3)))))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Another way to do it:

>>> [input[i:i+3] for i in range(len(input)-3+1)]
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Some benchmarks:

Setup:

import timeit

def ff1(input):
    return list(map(list, zip(input[0:], input[1:], input[2:])))

def ff2(input):
    return list(map(list, zip(*(input[i:] for i in range(3)))))

def ff3(input):
    return [input[i:i+3] for i in range(len(input)-3+1)]

def jg(input):
    for i in range(0, len(input) - 2):
        yield input[i:i+3]

def jg1(input):
    return list(jg(input))

import itertools

def n(input, n=3):
    i = list(itertoopls.tee(input, n))
    for p, it in enumerate(i):
        next(itertools.slice(it, p, p), None)
    return zip(*i)

def n1(input, _n=3):
    return list(map(list, n(input, _n)))

from numpy.lib.stride_tricks import as_strided

def strided_groupby(n, l=3):
    s = n.strides[0]
    return as_strided(n, shape=(n.size-l+1,l), strides=(s,s))

Results:

>>> input = list(range(10000))
>>> timeit.timeit(stmt='ff1(input)', globals=globals(), number=1000)
1.4750333260162733
>>> timeit.timeit(stmt='ff2(input)', globals=globals(), number=1000)
1.486136345018167
>>> timeit.timeit(stmt='ff3(input)', globals=globals(), number=1000)
1.6864491199958138
>>> timeit.timeit(stmt='jg1(input)', globals=globals(), number=1000)
2.300399674975779
>>> timeit.timeit(stmt='n1(input)', globals=globals(), number=1000)
2.2269885840360075
>>> input_arr = np.array(input)
>>> timeit.timeit(stmt='strided_groupby(input_arr)', globals=globals(), number=1000)
0.01855822204379365

Note that the inner list conversion waste a significant amount of CPU cycles. If you can afford to have tuples instead of lists, as the innermost sequences (i.e. (0,1,2), (1,2,3), ...) that is going to perform better.

For fairness of comparison I applied the same list conversion to all algorithms.

fferri
  • 18,285
  • 5
  • 46
  • 95
  • Any chance you could update with @nosklo's solution? I'm really curious to see how they compare :) – Grant Williams Apr 23 '18 at 19:25
  • Thanks for benchmarking mine :) I'll upvote in a couple hours when I'm able to again! – Jacob G. Apr 23 '18 at 19:26
  • fferi, think you could add the strided solution i implemented after @miradulo's comment? It's quite a bit sketchier but im curious to see if the potential speed improvements might make it worth it depending on OP's task (since it involves a few million records). – Grant Williams Apr 23 '18 at 20:18
4

If you are using numpy or pandas then you can use strides as @miradulo suggested. You need to be really careful when using them though. They can have very unexpected results when using vectorized operations on them, but miradulo is right in that it should be incredibly fast.

here is an example implementation:

def strided_groupby(n, l):
    s = n.strides[0]
    return as_strided(n, shape=(n.size-l+1,l), strides=(s,s))

Adapted from the documentation here scipy-strides

output looks like:

[[0 1 2]
 [1 2 3]
 [2 3 4]
 [3 4 5]
 [4 5 6]]

edit on my machine i got the following results:

>>> timeit.timeit(stmt='ff1(n)', globals=globals(), number=1000)
0.2299177199965925

>>> timeit.timeit(stmt='strided_groupby(n, 3)', globals=globals(), number=1000)
0.012110635001590708

which is actually a very significant difference.

Grant Williams
  • 1,469
  • 1
  • 12
  • 24
  • 1
    Nice - a larger data set would further emphasize this is constant time complexity. – miradulo Apr 23 '18 at 20:52
  • Its also important to make sure your ordering is the same. I think pandas defaults to C ordering? – Grant Williams Apr 23 '18 at 20:53
  • Yup. IMHO using fancy indexing and avoiding the trouble of this approach may very well be the best option based on what OP plans to do next. – miradulo Apr 23 '18 at 20:56
  • One last option (although semi annoying), would be to use Cython. A little more flexibility on how low level you want to get with the indexing. – Grant Williams Apr 23 '18 at 21:00
1

Is this what you called inefficient?

def answer(data): return [[data[k], data[k+1], data[k+2]] for k in range(len(data)-2)]

Ape Toshi
  • 153
  • 1
  • 10
  • That's basically algorithm `ff3` in my answer. – fferri Apr 23 '18 at 19:34
  • Mhm. Beautiful! I was just asking if this is the 'simple for loop' he was talking about, cause it seems to me you have to write more complicated code to make it faster? I know it's not about the lines, but still... – Ape Toshi Apr 23 '18 at 19:44
  • `[data[k], data[k+1], data[k+2]] == data[k:k+3]` – fferri Apr 23 '18 at 19:47
  • @AndrijaRadica my simple implementation is presented in the question in the Long version part. The Short version is there only to present the problem in a different (simple) abstraction. – trust512 Apr 23 '18 at 20:17
0

I have another naive solution, however I'm not fluent in Python, so I can't judge how fast it would be compared to zip:

def chunks(l):
    for i in range(0, len(l) - 2):
        yield l[i:i + 3]

if __name__ == '__main__':
    input = [0, 1, 2, 3, 4, 5, 6]

    print(list(chunks(input)))

Output:

[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Note: This assumes your input list has a length of at least 3.

Jacob G.
  • 28,856
  • 5
  • 62
  • 116
0

You can make a function based on itertools. This won't consume more elements from the iterable than needed.

import itertools

def groupwithnext(iterable, n=2):
    iterators = list(itertools.tee(iterable, n))
    for pos, iterator in enumerate(iterators):
        # advance each iterator by the correct number of elements
        next(itertools.islice(iterator, pos, pos), None) 
    return zip(*iterators)

Testing:

data = [0, 1, 2, 3, 4, 5, 6]

for g in groupwithnext(data, 3):
    print(g)

will print

(0, 1, 2)
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)
(4, 5, 6)`
nosklo
  • 217,122
  • 57
  • 293
  • 297