3

I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).

In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.

The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.

This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.

from sys import getsizeof

def time_framer(data_to_frame, window_size=1):
    """It transforms a 2d dataset into 3d based on a specific size;
    original function can be found at:
    https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
    """
    n_datapoints = data_to_frame.shape[0] - window_size
    framed_data = np.empty(
        shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)

    for index in range(n_datapoints):
        framed_data[index] = data_to_frame[index:(index + window_size)]
        print(framed_data.shape)

    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty quality test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.

I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.

from numpy.lib.stride_tricks import as_strided

def strides_trick_time_framer(data_to_frame, window_size=1):

    new_strides = (data_to_frame.strides[0],
                   data_to_frame.strides[0]*data_to_frame.shape[1] ,
                   data_to_frame.strides[0]*window_size)

    n_datapoints = data_to_frame.shape[0] - window_size
    print('striding.....')
    framed_data = as_strided(data_to_frame, 
                             shape=(n_datapoints, # .flatten() here did not change the outcome
                                    window_size,
                                    data_to_frame.shape[1]),                   
                                    strides=new_strides).astype(np.float32)
    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

Any help would be highly appreciated!

Gbsbvm
  • 75
  • 8
  • I edited the question since I actually convert to float32 in order to save space. I dont know if it changes anything – Gbsbvm Sep 04 '18 at 14:35

2 Answers2

2

You can use the stride template function window_nd I made here

Then to stride over just the first dimension you just need

framed_data = window_nd(data_to_frame, window_size, axis = 0)

Haven't found a built-in window function yet that can work over arbitrary axes, so unless there's been a new one implemented in scipy.signal or skimage recently, that's probably your best bet.

EDIT: To see the memory savings, you will need to use the method described by @ali_m here as the basic ndarray.nbytes is naive to shared memory.

def find_base_nbytes(obj):
    if obj.base is not None:
        return find_base_nbytes(obj.base)
    return obj.nbytes
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • the new array passes the quality check, but the size in terms of memory did not improve – Gbsbvm Sep 04 '18 at 07:21
  • 1
    Huh. `ndarray.nbytes` seems to be a naive `ndarray.itemsize * ndarray.size`. It doesn't take into acount shared elements at all. If you want to determine the actal size of the strided array, look [here](https://stackoverflow.com/questions/34637875/size-of-numpy-strided-array-broadcast-array-in-memory) for a method. – Daniel F Sep 04 '18 at 07:56
  • I see the memory improvement only if I use `sys.getsizeof` (no improvement with base attribute) and only if I keep the dtype as float64. if I use float32 in order to save some more memory the resulting array does not have the base attribute and does not improve in terms of memory – Gbsbvm Sep 04 '18 at 15:01
  • 1
    Be careful when using `getsizeof` with arrays: https://stackoverflow.com/questions/52129595/why-the-size-of-numpy-array-is-different – hpaulj Sep 04 '18 at 15:52
2

For this X:

In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)

this as_strided produces the same array as your time_framer

In [736]: np.lib.stride_tricks.as_strided(X, 
            shape=(X.shape[0]-3, 3, X.shape[1]), 
            strides=(24, 24, 8))
Out[736]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[ 6,  7,  8],
        [ 9, 10, 11],
        [12, 13, 14]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20]]])

It strides the last dimension just like X. And 2nd to the last as well. The first advances one row, so it too gets X.strides[0]. So the window size only affects the shape, not the strides.

So in your as_strided version just use:

 new_strides = (data_to_frame.strides[0],
                data_to_frame.strides[0] ,
                data_to_frame.strides[1])

Minor corrections. Set the default window size to 2 or larger. 1 produces an indexing error in the test.

framed_data[0,1]==framed_data[1,0]

Looking a getsizeof:

In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192

Wait, why is X size smaller than nbytes? Because it is a view (see line [734] above).

In [756]: sys.getsizeof(X.copy())
Out[756]: 304

As noted in another SO, getsizeof has to be used with caution:

Why the size of numpy array is different?

Now for the expanded copy:

In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512

and the strided version

In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512

x1 size is just like a view (128 because its 3d). But if we try to change its dtype, it makes a copy, and the strides and size are the same as x2.

Many operations on x1 will loose the strided size advantage, x1.ravel(), x1+1 etc. It's mainly reduction operations like mean and sum that produce a real space savings.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • by using `sys.getsizeof` I see the improvements but as I edited I actually convert dtype to float32 in order to save memory; as float 32 the "strided" array does not become any lighter – Gbsbvm Sep 04 '18 at 15:05
  • 1
    The `as_strided` array is a `view` of the original. That is, it uses the original's databuffer. The `astype` forces it to make a copy, and it will be full one. Compare the `strides` attribute with and without the `astype`. There are a limited number of things you can do with an `as_strided` array before it creates a full blown copy. – hpaulj Sep 04 '18 at 15:50
  • I added some `getsizeof` tests. – hpaulj Sep 04 '18 at 16:10
  • so, `getsizeof` is not useful for the a view - which is what the strides trick returns;`astype` on a view creates a copy of the original - neutralizing the benefit of the strides trick; @Daniel F pointed out that `nbytes` is a naive `ndarray.itemsize * ndarray.size`that doesn't take into account shared elements; – Gbsbvm Sep 04 '18 at 16:44
  • 1
    Right there isn't a meaningful measure of the memory savings with `as_strided`. As a view it doesn't take any extra memory (other than the array object overhead), and a copy is expanded to full size. – hpaulj Sep 04 '18 at 17:04