How to reduce memory consumption while splitting a Pandas DataFrame

Question

I have a pandas dataframe which has more than 5 millions rows. I try to slice it (time series to supervised frame) by using the function below. However it consumes too much ram and collaps during long sequences (look_back, forecast_horizon). As I know, lists are more compact than arrays, so I start with an empty list then convert it to np.array at the end. Is there a way to do this more compact? Assume sequence.shape is (5000000,20), X.shape is (~5000000, lookback, 20), y.shape is (~5000000, forecast_horizon)

def split_sequence(sequence, look_back, forecast_horizon):
 X, y = list(),list()
 for i in range(len(sequence)): 
   lag_end = i + look_back
   forecast_end = lag_end + forecast_horizon
   if forecast_end > len(sequence):
     break
   seq_x, seq_y = sequence[i:lag_end], sequence['first column'][lag_end:forecast_end]
   X.append(seq_x)
   y.append(seq_y)
 return np.array(X), np.array(y)

use chunks when creating dataframe or split like this when dataframe already exist https://stackoverflow.com/questions/44729727/pandas-slice-large-dataframe-into-chunks — Vitaliy Korolyk, Oct 14 '22 at 21:47
Doesn't it breaks the continuity of the time series? For example, the last 10 rows of the first part is not connected to the first rows of the second part if I want to guess 1 hour from last 10 hours? — ast, Oct 14 '22 at 21:57
This produces a huge 3D array - do you need it all at once? Could you grab slices as needed instead? — tdelaney, Oct 14 '22 at 22:19
10 slices (500k each) with lookback:100 crashes Colab Pro. I don't want to slice more since each slice vanishes 100 forecast points (first 100 forecast of each slice). I can add them to the end of each slice however I'm searching for more convenient way. I thought my function is not compact. — ast, Oct 14 '22 at 22:41

How to reduce memory consumption while splitting a Pandas DataFrame

0 Answers0