2

i have a time series dataset as a numpy array of shape:
(batch_size , observations , sensor_number

So for example:
(3,10,2) 3 batches of two sensors, each having time series data of length 10.

On that numpy array i now want to reshape the length of the time series as well as specify a overlapping factor.

So here is an example, trying to change the original dataset from above: The new period length of each sample should be 5 and i want the samples to overlapp by 0.4 (40%). For simplicity the time series data are from 1...10 The original dataset of shape (3,10,2) looks like:

array([[[ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10],
    [ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10]],
   [[ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10],
    [ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10]],
   [[ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10],
    [ 1,  2],[ 3,  4],[ 5,  6],[ 7,  8],[ 9, 10]]])

I would expect the new, reshaped numpy array to have the shape: (6,5,2). Each chunck will be windowed like shown below:
enter image description here
Overlapping: For the new target length of 5 a 40% overlapping means that 2 elements from the previous sample are overlapping into the next sample.
So reshaped with only the valid length time series elememts means in the case above to double the original data amount by slicing the original time series two a shorter time series with a overlapping between the samples.
I tried two reshape it by iterating through all elememts in a for loop but it takes so much time so i think there must be a more performant way of e.g. vectoricing the operation.

Can anyone please help and give hints on how to do that? Thanks in advice.

deniz
  • 63
  • 1
  • 8
  • Could you please provide a numeric example in which the earlier shape is perhaps `(3,20,2)` and the later shape is `(6,10,2)`. I suspect I'm failing to understand the concept of **overlap** that you're talking about. Such an example would then help. – fountainhead Oct 31 '20 at 20:00
  • with overlapping i mean: If the origin data shape is e.g. ```(3, 20,2)``` so 32 samples with 20 as period length and 2 sensors, with a overlapping of 0.5 (50%) and a new period length of e.g. 10 the origin time series would be chunked in elements of length 10 with a step size of 5 elements. For one time series of length 20(for simplicity values from 1...20) i would expect the chunks to look like:
    #1: 1...10 , #2: 5...15, #3: 10...20. So i increase the original batch size of 3 (in (3,20,2)) to 6 and the new shape of my dataset will be (6,10,2).
    – deniz Oct 31 '20 at 21:08
  • Continuing with this smaller-sized example, `observations` changes from `20` to `10` (a ratio of `10:20`), and simultaneously, an `overlap` of `0.5` is introduced. Is it co-incidental that the change in `observations` is `10:20` and the `overlap` is also `0.5`? In other words, is it possible to have the new shape of `(6, 10, 2)`, but a different `overlap`, say `0.25`, instead of `0.5`? – fountainhead Nov 01 '20 at 01:14
  • Please take your array as `A = np.arange(3*10*2).reshape(3,10,2)`. Then, for the new shape of `(6,5,2)`, and `overlap=0.2`, please edit your question to show us what will be the **exact** output of `print(new_A)`. I believe you will then discover for yourself that there's still some ambiguity in the problem description. – fountainhead Nov 01 '20 at 07:17
  • @fountainhead i edited my question above to make more clear what exactly the problem is. thanks in advice – deniz Nov 01 '20 at 14:30

3 Answers3

0

Let's go with 6 observations, for each 3 batches and 2 sensors.

data = np.arange(3*6).reshape((3,6))[:,:, None] * ((1,1))
size = 3
step = 2
print("data:\n", data)

Output:

data:
 [[[ 0  0]
  [ 1  1]
  [ 2  2]
  [ 3  3]
  [ 4  4]
  [ 5  5]]

 [[ 6  6]
  [ 7  7]
  [ 8  8]
  [ 9  9]
  [10 10]
  [11 11]]

 [[12 12]
  [13 13]
  [14 14]
  [15 15]
  [16 16]
  [17 17]]]

Here comes the elegant way I guess. It uses list comprehension, which is normally faster than normal for loops, especially here as I don't loop through the data but through the first indexes of each windows.

# Range until the last index where it is possible to start a window
last_start = data.shape[1] - size + 1
period_starts = range(0, last_start, step)

reshaped_data = np.concatenate(
                  [data[:,k:k+size] for k in period_starts],
                  axis=1).reshape(-1, size, data.shape[2])
print('Reshaped data:\n', reshaped_data)

Output:

Reshaped data:
 [[[ 0  0]
  [ 1  1]
  [ 2  2]]

 [[ 2  2]
  [ 3  3]
  [ 4  4]]

 [[ 6  6]
  [ 7  7]
  [ 8  8]]

 [[ 8  8]
  [ 9  9]
  [10 10]]

 [[12 12]
  [13 13]
  [14 14]]

 [[14 14]
  [15 15]
  [16 16]]]

It could probably be faster if we didn't calculate the interval during list comprehension, although at this point you probably don't need further enhancement:

period_intervals = np.array((period_starts, period_starts + size)).T
reshaped_data = np.concatenate(
                  [data[:,i:j] for i,j in period_intervals],
                  axis=1).reshape(-1, size, data.shape[2])

Alternatively, you could use indexes, but in this case it feels like a lot more complexity and code for doing the same thing (see Selecting multiple slices from a numpy array at once on stackoverflow):

last_start = data.shape[1] - size + 1
period_starts = np.arange(0, last_start, step)
period_intervals = np.array((period_starts, period_starts + size)).T

# Create indexes for one observation axis.
# You could use map if you truly want to avoid for loops.
period_indexes = np.array([np.arange(i, j) for i, j in period_intervals])
# repeat as much as needed
observation_indexes = np.tile(period_indexes, (data.shape[0],1))
print("\nIndexes of observations:\n", observation_indexes)

# Create batch indexes
batches = (np.arange(data.shape[0])[:, None]
           * np.ones(period_indexes.shape[-1], dtype=np.int8))
batch_indexes = np.repeat(batches, len(period_starts), axis=0)
print("\nIndexes for batch:\n", batch_indexes)

indexes = (batch_indexes, observation_indexes)
reshaped_data = data[indexes]
print("\nReshaped data:\n", reshaped_data)

Output:

Indexes of observations:
 [[0 1 2]
 [2 3 4]
 [0 1 2]
 [2 3 4]
 [0 1 2]
 [2 3 4]]

Indexes for batch:
 [[0 0 0]
 [0 0 0]
 [1 1 1]
 [1 1 1]
 [2 2 2]
 [2 2 2]]

Reshaped data:
 [[[ 0  0]
  [ 1  1]
  [ 2  2]]

 [[ 2  2]
  [ 3  3]
  [ 4  4]]

 [[ 6  6]
  [ 7  7]
  [ 8  8]]

 [[ 8  8]
  [ 9  9]
  [10 10]]

 [[12 12]
  [13 13]
  [14 14]]

 [[14 14]
  [15 15]
  [16 16]]]

Sorry to include it, it was simply my first try before realizing I could use my previous (erroneous) way in a much simpler and elegant way.

  • it doesnt give valid results. The new shape is valid but the elements are not chunked the right/valid way. Please see my post of my current version using the nested for loops for a numerical example. – deniz Nov 01 '20 at 16:31
  • Edited so it works your way and uses the names you provided in your own answer, and found another way too. – O.Laprevote Nov 01 '20 at 22:44
  • thanks, i tested the code and it works in less time than mine. – deniz Nov 02 '20 at 07:18
0
size = 5
step = 3

new_dataset = False

for batch in np.arange(0,data_array.shape[0],1):
    sample_data = data_array[batch].T
    sensor_buffer = False
    for sensor in np.arange(0,sample_data.shape[0],1):
        time_series = sample_data[sensor]
        splitted_timeseries = [time_series[i : i + size] for i in range(0,len(time_series), step)]
        valid_splitts = np.asarray([splitted_ts for splitted_ts in splitted_timeseries if len(splitted_ts)==size])
        valid_splitts = valid_splitts.reshape(valid_splitts.shape[0],
                                          size,
                                          1)
    
        if type(sensor_buffer) == bool:
            sensor_buffer = valid_splitts.copy()
        else:
            sensor_buffer = np.concatenate((sensor_buffer,valid_splitts.copy()),axis=-1)

    if type(new_dataset) == bool:
        new_dataset = sensor_buffer.copy()
        print(new_dataset.shape)
    else:
        new_dataset = np.concatenate((new_dataset,sensor_buffer),axis=0)
new_dataset.shape

This is my current version which gives correct results. size specifies the length of the new time series and step the steps (another way of defining the overlapping).

For a dataset array like this:

array([[[ 1.,  1.],
        [ 2.,  2.],
        [ 3.,  3.],
        [ 4.,  4.],
        [ 5.,  5.],
        [ 6.,  6.],
        [ 7.,  7.],
        [ 8.,  8.],
        [ 9.,  9.],
        [10., 10.]],

       [[ 1.,  1.],
        [ 2.,  2.],
        [ 3.,  3.],
        [ 4.,  4.],
        [ 5.,  5.],
        [ 6.,  6.],
        [ 7.,  7.],
        [ 8.,  8.],
        [ 9.,  9.],
        [10., 10.]],

       [[ 1.,  1.],
        [ 2.,  2.],
        [ 3.,  3.],
        [ 4.,  4.],
        [ 5.,  5.],
        [ 6.,  6.],
        [ 7.,  7.],
        [ 8.,  8.],
        [ 9.,  9.],
        [10., 10.]]])

It gives me the correct shape of (6,5,2) with valid time series chunks:

array([[[1., 1.],
        [2., 2.],
        [3., 3.],
        [4., 4.],
        [5., 5.]],

       [[4., 4.],
        [5., 5.],
        [6., 6.],
        [7., 7.],
        [8., 8.]],

       [[1., 1.],
        [2., 2.],
        [3., 3.],
        [4., 4.],
        [5., 5.]],

       [[4., 4.],
        [5., 5.],
        [6., 6.],
        [7., 7.],
        [8., 8.]],

       [[1., 1.],
        [2., 2.],
        [3., 3.],
        [4., 4.],
        [5., 5.]],

       [[4., 4.],
        [5., 5.],
        [6., 6.],
        [7., 7.],
        [8., 8.]]])
deniz
  • 63
  • 1
  • 8
0

These 3 lines should do the trick:

obs_starts = list(range(0,
                        1+old_obs_len - new_obs_len,
                        int(round((1-overlap)*new_obs_len))))
obs_indices = [list(range(x, x+new_obs_len)) for x in obs_starts]
new_A = A[:, obs_indices, :].reshape(-1, new_obs_len, num_sens)

Here:

A is your array, with shape (num_batches, old_obs_len, num_sens)

new_A is the new array, with shape (-1, new_obs_len, num_sens)

overlap is the overlap ratio.

Note that there is no repeated concatenation or tiling of arrays. So there's minimal copying of array data under the hood. The first two lines are for constructing a nested list of indices. The third line uses this index 'array' to index A, and shape the result.

Setting up demo data:

import numpy as np

num_batches = 3                              # Initial value
old_obs_len = 20                             # Initial value
num_sens = 2

                                             # Demo data
A = np.arange(num_batches*old_obs_len*num_sens).reshape(num_batches,
                              old_obs_len,
                              num_sens)
print (A.shape)
print (A)

Output of data setup:

(3, 20, 2)
[[[  0   1]
  [  2   3]
  [  4   5]
  [  6   7]
  [  8   9]
  [ 10  11]
  [ 12  13]
  [ 14  15]
  [ 16  17]
  [ 18  19]
  [ 20  21]
  [ 22  23]
  [ 24  25]
  [ 26  27]
  [ 28  29]
  [ 30  31]
  [ 32  33]
  [ 34  35]
  [ 36  37]
  [ 38  39]]

 [[ 40  41]
  [ 42  43]
  [ 44  45]
  [ 46  47]
  [ 48  49]
  [ 50  51]
  [ 52  53]
  [ 54  55]
  [ 56  57]
  [ 58  59]
  [ 60  61]
  [ 62  63]
  [ 64  65]
  [ 66  67]
  [ 68  69]
  [ 70  71]
  [ 72  73]
  [ 74  75]
  [ 76  77]
  [ 78  79]]

 [[ 80  81]
  [ 82  83]
  [ 84  85]
  [ 86  87]
  [ 88  89]
  [ 90  91]
  [ 92  93]
  [ 94  95]
  [ 96  97]
  [ 98  99]
  [100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]
  [110 111]
  [112 113]
  [114 115]
  [116 117]
  [118 119]]]

Test Case 1:

overlap = 0.5
new_obs_len = 10

Output for Test Case 1 (print (new_A)):

[[[  0   1]
  [  2   3]
  [  4   5]
  [  6   7]
  [  8   9]
  [ 10  11]
  [ 12  13]
  [ 14  15]
  [ 16  17]
  [ 18  19]]

 [[ 10  11]
  [ 12  13]
  [ 14  15]
  [ 16  17]
  [ 18  19]
  [ 20  21]
  [ 22  23]
  [ 24  25]
  [ 26  27]
  [ 28  29]]

 [[ 20  21]
  [ 22  23]
  [ 24  25]
  [ 26  27]
  [ 28  29]
  [ 30  31]
  [ 32  33]
  [ 34  35]
  [ 36  37]
  [ 38  39]]

 [[ 40  41]
  [ 42  43]
  [ 44  45]
  [ 46  47]
  [ 48  49]
  [ 50  51]
  [ 52  53]
  [ 54  55]
  [ 56  57]
  [ 58  59]]

 [[ 50  51]
  [ 52  53]
  [ 54  55]
  [ 56  57]
  [ 58  59]
  [ 60  61]
  [ 62  63]
  [ 64  65]
  [ 66  67]
  [ 68  69]]

 [[ 60  61]
  [ 62  63]
  [ 64  65]
  [ 66  67]
  [ 68  69]
  [ 70  71]
  [ 72  73]
  [ 74  75]
  [ 76  77]
  [ 78  79]]

 [[ 80  81]
  [ 82  83]
  [ 84  85]
  [ 86  87]
  [ 88  89]
  [ 90  91]
  [ 92  93]
  [ 94  95]
  [ 96  97]
  [ 98  99]]

 [[ 90  91]
  [ 92  93]
  [ 94  95]
  [ 96  97]
  [ 98  99]
  [100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]]

 [[100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]
  [110 111]
  [112 113]
  [114 115]
  [116 117]
  [118 119]]]

Test Case 2:

overlap = 0.2
new_obs_len = 10

Output for Test Case 2: (print (new_A)):

[[[  0   1]
  [  2   3]
  [  4   5]
  [  6   7]
  [  8   9]
  [ 10  11]
  [ 12  13]
  [ 14  15]
  [ 16  17]
  [ 18  19]]

 [[ 16  17]
  [ 18  19]
  [ 20  21]
  [ 22  23]
  [ 24  25]
  [ 26  27]
  [ 28  29]
  [ 30  31]
  [ 32  33]
  [ 34  35]]

 [[ 40  41]
  [ 42  43]
  [ 44  45]
  [ 46  47]
  [ 48  49]
  [ 50  51]
  [ 52  53]
  [ 54  55]
  [ 56  57]
  [ 58  59]]

 [[ 56  57]
  [ 58  59]
  [ 60  61]
  [ 62  63]
  [ 64  65]
  [ 66  67]
  [ 68  69]
  [ 70  71]
  [ 72  73]
  [ 74  75]]

 [[ 80  81]
  [ 82  83]
  [ 84  85]
  [ 86  87]
  [ 88  89]
  [ 90  91]
  [ 92  93]
  [ 94  95]
  [ 96  97]
  [ 98  99]]

 [[ 96  97]
  [ 98  99]
  [100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]
  [110 111]
  [112 113]
  [114 115]]]

Test Case 3:

overlap = 0.8
new_obs_len = 10

Output for Test Case 1 (print (new_A[16:18,:,:])):

[[[ 96  97]
  [ 98  99]
  [100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]
  [110 111]
  [112 113]
  [114 115]]

 [[100 101]
  [102 103]
  [104 105]
  [106 107]
  [108 109]
  [110 111]
  [112 113]
  [114 115]
  [116 117]
  [118 119]]]
fountainhead
  • 3,584
  • 1
  • 8
  • 17
  • I tested the code above but it seems to work only with overlapps <0.8 and will fail for overlapps of >= 0.8 – deniz Nov 02 '20 at 07:00
  • @deniz - Corrected the bug -- instead of `int(round(blah-blah))`, I had `round(int(blah-blah))`. After this correction, I was able to successfully verify all 3 test cases again (`overlap=0.2`, `overlap=0.5`, and `overlap=0.8`). Apologies and thanks. – fountainhead Nov 02 '20 at 07:25
  • @deniz - Found and fixed one more bug -- in all 3 test cases (`0.2`, `0.5`, and `0.8`), it was missing some data towards the end. Now, it produces more data. – fountainhead Nov 02 '20 at 08:05