1

A very simple example just for understanding.

I have the following pandas dataframe:

import pandas as pd
df = pd.DataFrame({'A':pd.Series([1, 2, 13, 14, 25, 26, 37, 38])})
df 
        A
    0   1
    1   2
    2  13
    3  14
    4  25
    5  26
    6  37
    8  38

Set n = 3

First example

How to get a new dataframe df1 (in an efficient way), like the following:

   D1  D2  D3     T
0   1   2  13    14
1   2  13  14    25
2  13  14  25    26
3  14  25  26    37
4  25  26  37    38

Hint: think at the first n-columns as the data (Dx) and the last columns as the target (T). In the 1st example the target (e.g 25) depends on the preceding n-elements (2, 13, 14).

Second example

What if the target is some element ahead (e.g.+3)?

   D1  D2  D3     T
0   1   2  13    26
1   2  13  14    37
2  13  14  25    38

Thank you for your help,
Gilberto

P.S. If you think that the title can be improved, please suggest me how to modify it.

Update

Thanks to @Divakar and this post the rolling function can be defined as:

import numpy as np
def rolling(a, window):
    shape = (a.size - window + 1, window)
    strides = (a.itemsize, a.itemsize)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.arange(1000000000)
b = rolling(a, 4)

In less than 1 second!

Community
  • 1
  • 1
Gilberto
  • 813
  • 7
  • 17

2 Answers2

2

Let's see how we can solve it with NumPy tools. So, let's imagine you have the column data as a NumPy array, let's call it a. For such sliding windowed operations, we have a very efficient tool in NumPy as strides, as they are views into the input array without actually making copies.

Let's directly use the methods with the sample data and start with case #1 -

In [29]: a  # Input data
Out[29]: array([ 1,  2, 13, 14, 25, 26, 37, 38])

In [30]: m = a.strides[0] # Get strides

In [31]: n = 3 # parameter

In [32]: nrows = a.size - n # Get number of rows in o/p

In [33]: a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,n+1),strides=(m,m))

In [34]: a2D
Out[34]: 
array([[ 1,  2, 13, 14],
       [ 2, 13, 14, 25],
       [13, 14, 25, 26],
       [14, 25, 26, 37],
       [25, 26, 37, 38]])

In [35]: np.may_share_memory(a,a2D) 
Out[35]: True    # a2D is a view into a

Case #2 would be similar with an additional parameter for the Target column -

In [36]: n2 = 3 # Additional param

In [37]: nrows = a.size - n - n2 + 1

In [38]: part1 = np.lib.stride_tricks.as_strided(a,shape=(nrows,n),strides=(m,m))

In [39]: part1 # These are D1, D2, D3, etc.
Out[39]: 
array([[ 1,  2, 13],
       [ 2, 13, 14],
       [13, 14, 25]])

In [43]: part2 = a[n+n2-1:] # This is target col

In [44]: part2
Out[44]: array([26, 37, 38])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you @Divakar it works. It is just a little hard for me to understand, but it depends on my lack of knowledge. I will study your solution. I wonder if there is a solution with pandas instructions. – Gilberto Oct 24 '16 at 12:32
  • 1
    @Gilberto Here's a broadcasting based solution to a related problem, if that's easier to follow - http://stackoverflow.com/a/40169790/3293881 – Divakar Oct 24 '16 at 12:36
  • I saw the time results: it's astonishing! It worth hours of study, definitely! Thank you @Divakar again. – Gilberto Oct 24 '16 at 13:02
0

I found another method: view_as_windows

import numpy as np
from skimage.util.shape import view_as_windows
window_shape = (4, )

aa = np.arange(1000000000) # 1 billion!
bb = view_as_windows(aa, window_shape)
bb

array([[        0,         1,         2,         3],
       [        1,         2,         3,         4],
       [        2,         3,         4,         5],
       ..., 
       [999999994, 999999995, 999999996, 999999997],
       [999999995, 999999996, 999999997, 999999998],
       [999999996, 999999997, 999999998, 999999999]])

Around 1 second.

What do you think?

Gilberto
  • 813
  • 7
  • 17