I retrieve a big amount (>100.000) of time series a database. One time series is a 2D List with 5 to 10 entries, each entry holds 8 values:
single_time_series = [
[ 43, 1219, 1065, 929, 1233, 2604, 3101, 2196],
[ 70, 1148, 1041, 785, 1344, 2944, 3519, 3506],
[ 80, 1148, 976, 710, 1261, 2822, 3335, 3247],
[ 103, 1236, 1090, 762, 1024, 2975, 3777, 3093],
[ 120, 883, 937, 493, 1221, 4119, 5241, 5133],
[ 143, 1110, 1089, 887, 1420, 2471, 2905, 2845]
] # a time series with 6 entries, each entry represents one day
For further processing I want all of these individual time series together in one 3D numpy array. But since the length of each series may vary between 5 and 10 entries I need to pad every time series that is shorter than 10 with zero-filled-arrays:
[
[ 43, 1219, 1065, 929, 1233, 2604, 3101, 2196],
[ 70, 1148, 1041, 785, 1344, 2944, 3519, 3506],
[ 80, 1148, 976, 710, 1261, 2822, 3335, 3247],
[ 103, 1236, 1090, 762, 1024, 2975, 3777, 3093],
[ 120, 883, 937, 493, 1221, 4119, 5241, 5133],
[ 143, 1110, 1089, 887, 1420, 2471, 2905, 2845],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0]
]
Currently I'm achieving this by iterating over each time series coming from the database, padding it and appending it to the final numpy array:
MAX_SEQUENCE_LENGTH = 10
all_time_series = ... # retrieved from db
all_padded_time_series = np.array([], dtype=np.int64).reshape(0, MAX_SEQUENCE_LENGTH, 8)
for single_time_series in all_time_series:
single_time_series = np.array(single_time_series, dtype=np.int64)
length_diff = MAX_SEQUENCE_LENGTH - single_time_series.shape[0]
if length_diff > 0:
single_time_series = np.pad(single_time_series, ((0, length_diff), (0,0)), mode='constant')
all_padded_time_series = np.append(all_padded_time_series, [single_time_series], axis=0)
While the database request executes in a matter of seconds, the padding and appending operations take for ever – the script needs ~45 minutes for ~100.000 time series on my iMac.
Since the database keeps growing I need to analyse even more data in the near future. So I'm looking for a faster way to convert the list coming from the db to a numpy array. I'm pretty sure there is a more efficient way to do this – any ideas?