How do I format my training data for an LSTM network using Keras when I have multiple varying length time-series data?

Question

I have two sets of training data that are of different lengths. I'll call these data series as the x_train data. Their shapes are (70480, 7) and (69058, 7), respectively. Each column represents a different sensor reading.

I am trying to use an LSTM network on this data. Should I merge the data into one object? How would I do that?

I also have two sets of data that are the resultant output from the x_train data. These are both of size (315,1). Would I use this as my y_train data?

So far I have read the data using pandas.read_csv() as follows:

c4_x_train = pd.read_csv('path')
c4_y_train = pd.read_csv('path')

c6_x_train = pd.read_csv('path')
c6_y_train = pd.read_csv('path')

Any clarification is appreciated. Thanks!

Dustin · Accepted Answer · 2020-08-11T20:59:38.277

Just a few points

For fast file reading, consider using a different format like parquet or feather. Careful about depreciation, so for longtime storage, csv is just fine.
pd.concat is your friend here. Use like this

from pathlib import Path
import pandas as pd
dir_path = r"yourFolderPath"
files_list = [str(p) for p in dir_path.glob("**/*.csv")]
if files_list:
    source_dfs = [pd.read_csv(file_) for file_ in files_list]
    df = pd.concat(source_dfs, ignore_index=True)

This df then you can use to do your training.

Now, regarding the training. Well, that really depends as always. If you have the datetime in those csvs and they are continuous, go right ahead. If you have breaks inbetween the measurements, you might run into problems. Depending on trends, saisonality and noise, you could interpolate missing data. There are multiple approaches, such as the naive approach, filling it with the mean, forecasting from the values before, and many more. There is no right or wrong, it just really depends on what your data looks like.

EDIT: Comments don't like codeblocks. Works like this: Example:

#df1:
time    value
    1     1.4
    2     2.5

#df2:
time    value
    3     1.1
    4     1.0

#will be glued together to become df = pd.concat([df1, df2], ignore_index=True)
time    value
   1      1.4
   2      2.5
   3      1.1
   4      1.0

Hey @Dustin, thanks for the reply. So, the last line of the code you provided will essentially stack the two sets of data, is that right? — Quyed, Aug 11 '20 at 20:52
Yes. It is appending the data to each other like you would take tape. See example in post. — Dustin, Aug 11 '20 at 20:55

How do I format my training data for an LSTM network using Keras when I have multiple varying length time-series data?

1 Answers1