I am preprocessing a timeseries dataset changing its shape from 2-dimensions (datapoints, features) into a 3-dimensions (datapoints, time_window, features).
In such perspective time windows (sometimes also called look back) indicates the number of previous time steps/datapoints that are involved as input variables to predict the next time period. In other words time windows is how much data in past the machine learning algorithm takes into consideration for a single prediction in the future.
The issue with such approach (or at least with my implementation) is that it is quite inefficient in terms of memory usage since it brings data redundancy across the windows causing the input data to become very heavy.
This is the function that I have been using so far to reshape the input data into a 3 dimensional structure.
from sys import getsizeof
def time_framer(data_to_frame, window_size=1):
"""It transforms a 2d dataset into 3d based on a specific size;
original function can be found at:
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
"""
n_datapoints = data_to_frame.shape[0] - window_size
framed_data = np.empty(
shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)
for index in range(n_datapoints):
framed_data[index] = data_to_frame[index:(index + window_size)]
print(framed_data.shape)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty quality test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
I have been suggested to use numpy's strides trick to overcome such problem and reduce the size of the reshaped data. Unfortunately, any resource I found so far on this subject is focused on implementing the trick on a 2 dimensional array, just as this excellent tutorial. I have been struggling with my use case which involves a 3 dimensional output. Here is the best I came out with; however, it neither succeeds in reducing the size of the framed_data, nor it frames the data correctly as it does not pass the quality test.
I am quite sure that my error is on the strides parameter which I did not fully understood. The new_strides are the only values I managed to successfully feed to as_strided.
from numpy.lib.stride_tricks import as_strided
def strides_trick_time_framer(data_to_frame, window_size=1):
new_strides = (data_to_frame.strides[0],
data_to_frame.strides[0]*data_to_frame.shape[1] ,
data_to_frame.strides[0]*window_size)
n_datapoints = data_to_frame.shape[0] - window_size
print('striding.....')
framed_data = as_strided(data_to_frame,
shape=(n_datapoints, # .flatten() here did not change the outcome
window_size,
data_to_frame.shape[1]),
strides=new_strides).astype(np.float32)
# it prints the size of the output in MB
print(framed_data.nbytes / 10 ** 6)
print(getsizeof(framed_data) / 10 ** 6)
# quick and dirty test to check if the data has been correctly reshaped
test1=list(set(framed_data[0][1]==framed_data[1][0]))
if test1[0] and len(test1)==1:
print('Data is correctly framed')
return framed_data
Any help would be highly appreciated!