1

I have a DataFrame representing a time series of feature values at second intervals

>>> df
   FeatA   FeatB   FeatC
0      1       6      11
1      2       7      12
2      3       8      13
3      4       9      14
4      5      10      15

I want to use this to construct a training data set for a Scikit-Learn model. In each row I want to add the feature values for the previous 15 minutes (900 rows) in the following format

>>> df
   FeatA  FeatB  FeatC  FeatA_T1  FeatB_T1  FeatC_T1  FeatA_T2  FeatB_T2 ...
0      1      6     11       NaN       NaN       NaN       NaN       NaN
1      2      7     12         1         6        11       NaN       NaN
2      3      8     13         2         7        12         1         6
3      4      9     14         3         8        13         2         7
4      5     10     15         4         9        14         3         8

Currently the code I use is essentially

for i in range(1, 900):
    for feature in ["FeatA", "FeatB", "FeatC"]:
        df[f"{feature}_T{i}"] = df[feature].shift(i)

There are 23,400 rows in the original DataFrame and 137 features so this method is inefficient and unusable. Since this is being fed to Scikit-Learn, the final data needs to be a numpy array (shown below). I'm fairly certain it would be faster to do these manipulations in numpy instead of pandas, however all the examples I've found use the pandas shift function.

How can I construct this dataset efficiently from the original DataFrame? Are numpy array functions the way to go?

Expected Result

array([[ 1.,  6., 11., nan, nan, nan, nan, nan, nan],
       [ 2.,  7., 12.,  1.,  6., 11., nan, nan, nan],
       [ 3.,  8., 13.,  2.,  7., 12.,  1.,  6., 11.],
       [ 4.,  9., 14.,  3.,  8., 13.,  2.,  7., 12.],
       [ 5., 10., 15.,  4.,  9., 14.,  3.,  8., 13.]])

N.B. Ultimately I plan to slice off the first 900 and last 900 rows of the array so any result that doesn't include them would work.

bphi
  • 3,115
  • 3
  • 23
  • 36
  • I don't know that `numpy` is going to solve your problem either... IIUC, you're looking to create 900 shifts for every one of 137 features, and your data begins with 23,000 rows. That's 2.8 Billion cells, and over 114,000 features you'll send to the model. How sure are you about that? – ALollz Sep 23 '19 at 19:24
  • I may end up resampling to a lower frequency or reducing the number of shifts, but I don't think reducing the scale changes the underlying question. – bphi Sep 23 '19 at 19:28
  • If you used a Recurrent Neural Network you wouldn't need to do this. – Michael Gardner Sep 23 '19 at 19:35
  • As strided: `window = 3; as_strided(df.values.ravel(), (len(df)-window+1, window*df.shape[1]), (8,8))` – Quang Hoang Sep 23 '19 at 19:42

1 Answers1

4

We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a NaNs padded version of the input and being a view would be efficient on memory and performance.

The implementation would look something like this -

from skimage.util.shape import view_as_windows

def sliding_windows_grouped(a, W, fillval = np.nan):
    # W : Number of rows to be grouped for each row in output array
    m,n = a.shape
    ext = np.full((W-1)*n, fillval)
    a_ext = np.concatenate((a[::-1].ravel(),ext))
    return view_as_windows(a_ext,n*W,step=n)[::-1]

More info on use of as_strided based view_as_windows.

Samnple runs -

In [63]: a
Out[63]: 
array([[55, 58],
       [75, 78],
       [78, 20],
       [94, 32],
       [47, 98]])

In [64]: sliding_windows_grouped(a, W=2)
Out[64]: 
array([[55., 58., nan, nan],
       [75., 78., 55., 58.],
       [78., 20., 75., 78.],
       [94., 32., 78., 20.],
       [47., 98., 94., 32.]])

In [65]: sliding_windows_grouped(a, W=3)
Out[65]: 
array([[55., 58., nan, nan, nan, nan],
       [75., 78., 55., 58., nan, nan],
       [78., 20., 75., 78., 55., 58.],
       [94., 32., 78., 20., 75., 78.],
       [47., 98., 94., 32., 78., 20.]])

In [66]: sliding_windows_grouped(a, W=4)
Out[66]: 
array([[55., 58., nan, nan, nan, nan, nan, nan],
       [75., 78., 55., 58., nan, nan, nan, nan],
       [78., 20., 75., 78., 55., 58., nan, nan],
       [94., 32., 78., 20., 75., 78., 55., 58.],
       [47., 98., 94., 32., 78., 20., 75., 78.]])

Marginal improvement

We can skip the final flipping at the last step from the solution proposed earlier with the use of np.lib.stride_tricks.as_strided, like so -

def sliding_windows_grouped_v2(a, W, fillval = np.nan):
    # W : Number of rows to be grouped for each row in output array
    m,n = a.shape
    ext = np.full((W-1)*n, fillval)
    a_ext = np.concatenate((a[::-1].ravel(),ext))
    strided = np.lib.stride_tricks.as_strided
    s = a_ext.strides[0]
    return strided(a_ext[(m-1)*n:],strides=(-n*s,s), shape=(m,W*n))
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you for your excellent answer! Could you point me in the right direction in terms of using the resulting views. I need to aggregate a few of them and shuffle, np.concatenate caused a memory spike, is there a similar function / group of functions I should look to use instead? – bphi Sep 23 '19 at 20:38
  • @bphi Would it bve okay to have a 3D array as output? So, for your expected output for the given sample in the question, it would be `(5,3,3)` shaped with each row becoming a 2D slice instead. – Divakar Sep 23 '19 at 20:42
  • I think my question is whether there is any way to provide the resulting view to a Scikit-Learn fit function without realizing the entirety of it into memory, some type of chunking/iteration perhaps. That may be better suited to a separate question though – bphi Sep 23 '19 at 20:50
  • @bphi That concatenation one is an essential step here. Can you compromise on skipping the first `W-1` rows in the output or processing them separately? If not, I can't think of any other way out. – Divakar Sep 23 '19 at 21:02