1

I have a pandas dataframe which has more than 5 millions rows. I try to slice it (time series to supervised frame) by using the function below. However it consumes too much ram and collaps during long sequences (look_back, forecast_horizon). As I know, lists are more compact than arrays, so I start with an empty list then convert it to np.array at the end. Is there a way to do this more compact? Assume sequence.shape is (5000000,20), X.shape is (~5000000, lookback, 20), y.shape is (~5000000, forecast_horizon)

def split_sequence(sequence, look_back, forecast_horizon):
 X, y = list(),list()
 for i in range(len(sequence)): 
   lag_end = i + look_back
   forecast_end = lag_end + forecast_horizon
   if forecast_end > len(sequence):
     break
   seq_x, seq_y = sequence[i:lag_end], sequence['first column'][lag_end:forecast_end]
   X.append(seq_x)
   y.append(seq_y)
 return np.array(X), np.array(y)
ast
  • 35
  • 1
  • 7
  • use chunks when creating dataframe or split like this when dataframe already exist https://stackoverflow.com/questions/44729727/pandas-slice-large-dataframe-into-chunks – Vitaliy Korolyk Oct 14 '22 at 21:47
  • Doesn't it breaks the continuity of the time series? For example, the last 10 rows of the first part is not connected to the first rows of the second part if I want to guess 1 hour from last 10 hours? – ast Oct 14 '22 at 21:57
  • This produces a huge 3D array - do you need it all at once? Could you grab slices as needed instead? – tdelaney Oct 14 '22 at 22:19
  • 10 slices (500k each) with lookback:100 crashes Colab Pro. I don't want to slice more since each slice vanishes 100 forecast points (first 100 forecast of each slice). I can add them to the end of each slice however I'm searching for more convenient way. I thought my function is not compact. – ast Oct 14 '22 at 22:41

0 Answers0