numpy time series merge and fill missing values with earlier values

Question

My data looks like this:

timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...

The data is ingested into two numpy arrays:

A series of times stamps raw_timestamp, dtype=[('datetime', '<M8[s]')]

'2009-01-01T18:41:00', 
'2009-01-01T18:44:00',
'2009-01-01T18:46:00', 
'2009-01-01T18:47:00',

A table of sensor data raw_sensor, dtype=[ ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]
```
 (755, 855, 755, 855, 743, 843, 743, 843, 2),
 (693, 793, 693, 793, 693, 793, 693, 793, 1),
 (755, 855, 755, 855, 743, 843, 743, 843, 2),
 (693, 793, 693, 793, 693, 793, 693, 793, 1),
```

I generate a new filled_timestamp and fill the timestamp at every row for every time step: filled_timestamp = np.arange(np.datetime64(starttime), np.datetime64(endtime), np.timedelta64(interval))

Using idxs = np.in1d(filled_timestamp,raw_timestamp), I have all the indices of filled which match with time stamps of raw. so I can assign filled_sensor with the matching data from raw_sensor

filled_sensor[idxs] = raw_sensor

Q1. Is there a better / faster way to intersect these?

Now filled arrays looks like:

>>> filled_timestamp, filled_sensor # shown side-by-side for convenience 
    array([ 
      1 #  ('2009-01-01T18:41:00')  (755, 855, 755, 855, 743, 843, 743, 843, 2),
      2 #  ('2009-01-01T18:42:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      3 #  ('2009-01-01T18:43:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      4 #  ('2009-01-01T18:44:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      5 #  ('2009-01-01T18:45:00')  (0, 0, 0, 0, 0, 0, 0, 0, 0),
      6 #  ('2009-01-01T18:46:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      7 #  ('2009-01-01T18:47:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1)
       ],
          dtype=[('datetime', '<M8[s]')], [('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]

Q2. How can I fill the missing rows with values from the first previous non-empty row? Except column(0 and 3 and last) which is 0 for fills

In my example above:

Row 2 and 3 would take values from Row 1,

Row 5 would take values from Row 4

End result:

>>> filled_timestamp, filled_sensor # shown side-by-side for convenience 
    array([ 
      1 #  ('2009-01-01T18:41:00')  (755, 855, 755, 855, 743, 843, 743, 843, 2),
      2 #  ('2009-01-01T18:42:00')  (0, 855, 755, 0, 743, 843, 743, 843, 0),
      3 #  ('2009-01-01T18:43:00')  (0, 855, 755, 0, 743, 843, 743, 843, 0),
      4 #  ('2009-01-01T18:44:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      5 #  ('2009-01-01T18:45:00')  (0, 793, 693, 0, 693, 793, 693, 793, 0),
      6 #  ('2009-01-01T18:46:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1),
      7 #  ('2009-01-01T18:47:00')  (693, 793, 693, 793, 693, 793, 693, 793, 1)
       ],
          dtype=[('datetime', '<M8[s]')], [('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')]

Would you consider splitting off the times into a separate array? — Mad Physicist, May 11 '17 at 20:21
Yes if the dimensions of the arrays won't be changing. Since theres no dependency on the timestamp itself, I can split and recombine later. — encore2097, May 11 '17 at 20:22
Then you could work with a single t-axis of type timestamp64 and a rows x sensors 2D array without having to have a field array. It would make life a lot easier. — Mad Physicist, May 11 '17 at 20:22
My current solution will overwrite the timestamps and I don't know how to prevent that without accessing all the columns by name. — Mad Physicist, May 11 '17 at 20:23
It will also make the computation of arrayB nearly trivial in one go given the vectorized solution I just figured out. — Mad Physicist, May 11 '17 at 20:24
Do you have a new answer / vectorized solution? I was think of a np.where(false, find argmax(where true but — encore2097, May 11 '17 at 20:25
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/144008/discussion-between-encore2097-and-mad-physicist). — encore2097, May 11 '17 at 20:26

Mad Physicist · Accepted Answer · 2017-05-12T12:32:21.163

Intersection

Your best bet for a fast intersection is probably np.searchsorted. It will do a binary search in filled_timestamp for the elements of raw_timestamp:

idx = np.searchsorted(filled_timestamp, raw_timestamp)

This will only be accurate if every element of raw_timestamp actually occurs in filled_timestamp because np.searchsorted will return an insertion index regardless.

Non-vectorized Solution

You want to set a slice of filled_sensor from idx[n] to idx[n + 1] to the value of raw_sensor[n]:

from itertools import zip_longest
for start, end, row in zip_longest(idx, idx[1:], raw_sensor):
    filled_sensor[start:end] = row

I am using zip_longest here so that the last value coming from idx[1:] would be None, making the last slice be equivalent to filled_sensor[idx[-1]:] without requiring a special condition.

Vectorized Solution

You can create filled_sensor in one shot directly from raw_sensor if you know which indices to repeat from raw_sensor. You can get that information by applying np.cumsum to idx converted to a boolean array:

idx_mask = np.zeros(filled_timestamp.shape, np.bool)
idx_mask[idx] = True

Basically, we start with a boolean array of the same size as filled_timestamp that is True (1) wherever an entry from raw_timestamp matches. We can convert that to an index in raw_timestamp by counting how many total matches have occurred up to that point:

indexes = np.cumsum(idx_mask) - 1

Keep in mind that indexes is an array of integers, not booleans. It will increment whenever a new match is found. The - 1 converts from count to index because the first match will have a count of 1 instead of 0.

Now you can just make filled_sensor:

filled_sensor = raw_sensor[indexes]

The only possible caveat here is if filled_sensor[0] does not come from raw_sensor[0]. It will then be replaced with raw_sensor[-1]. Given how you construct the times in filled based on raw, I am not sure can ever even be an issue.

Example

Here is an example of the Intersection and Vectorized Solution steps with the data that you show in your question.

We start with

raw_timestamp = np.array(['2009-01-01T18:41:00', 
                          '2009-01-01T18:44:00',
                          '2009-01-01T18:46:00',
                          '2009-01-01T18:47:00',], dtype='datetime64[s]')
raw_sensor = np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
                       (693, 793, 693, 793, 693, 793, 693, 793, 1),
                       (755, 855, 755, 855, 743, 843, 743, 843, 2),
                       (693, 793, 693, 793, 693, 793, 693, 793, 1),],
                      dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
                             ('sensorC', '<u4'), ('sensorD', '<u4'),
                             ('sensorE', '<u4'), ('sensorF', '<u4'),
                             ('sensorG', '<u4'), ('sensorH', '<u4'),
                             ('signal', '<u4')])

We can generate filled_timestamp as

filled_timestamp = np.arange('2009-01-01T18:41:00',
                             '2009-01-01T18:48:00', 60, dtype='datetime64[s]')

Which yields, as expected:

array(['2009-01-01T18:41:00', '2009-01-01T18:42:00', '2009-01-01T18:43:00',
       '2009-01-01T18:44:00', '2009-01-01T18:45:00', '2009-01-01T18:46:00',
       '2009-01-01T18:47:00'], dtype='datetime64[s]')

I have taken a slight liberty with the dtypes by making timestamps plain arrays instead of structured arrays, but I think that should make no difference for your purpose.

idx = np.searchsorted(filled_timestamp, raw_timestamp) yields
```
idx = np.array([0, 3, 5, 6], dtype=np.int)
```
This means that indices 0, 3, 5, 6 in filled_timestamp match values from raw_timestamp.
idx_mask then becomes
```
idx_mask = np.array([True, False, False, True, False, True, True], dtype=np.bool)
```
This is basically synonymous with idx, except expanded to boolean mask the same size as filled_timestamp.
Now the tricky part: indexes = np.cumsum(idx_mask) - 1:
```
indexes = array([0, 0, 0, 1, 1, 2, 3], dtype=np.int)
```
This can be interpreted as follows: filled_sensor[0:3] should come from raw_sensor[0]. filled_sensor[3:5] should come from raw_sensor[1], filled_sensor[5] should come from raw_sensor[2], filled_sensor[6] should come from raw_sensor[3].

So now we use indexes to directly extract the correct elements of raw_sensor using filled_sensor = raw_sensor[indexes]:

np.array([(755, 855, 755, 855, 743, 843, 743, 843, 2),
          (755, 855, 755, 855, 743, 843, 743, 843, 2),
          (755, 855, 755, 855, 743, 843, 743, 843, 2),
          (693, 793, 693, 793, 693, 793, 693, 793, 1),
          (693, 793, 693, 793, 693, 793, 693, 793, 1),
          (755, 855, 755, 855, 743, 843, 743, 843, 2),
          (693, 793, 693, 793, 693, 793, 693, 793, 1)], 
         dtype=[('sensorA', '<u4'), ('sensorB', '<u4'),
                ('sensorC', '<u4'), ('sensorD', '<u4'),
                ('sensorE', '<u4'), ('sensorF', '<u4'),
                ('sensorG', '<u4'), ('sensorH', '<u4'),
                ('signal', '<u4')])

There a step missing to convert the masked T / F into an array of indices where T is the index and F is a 0, then `np.cumsum` works properly. Still working out how to do this as well as masking 0 in the filled fields. — encore2097, May 11 '17 at 21:39
Sorry, my update shows how to do the intersect faster. All the steps are there now. I suggest you print out all the intermediate steps to see what they do exactly. — Mad Physicist, May 11 '17 at 21:47
Masking the 0 in the filled in fields is possible with inverting idx to get all the indices of the false values like `filled_sensor[fidx][[0] = 0`, `filled_sensor[fidx][[3] = 0`, `filled_sensor[fidx][[1] = 0` — encore2097, May 11 '17 at 21:53
I'm assuming `idx` is numerical (not bool mask) indices in `filled_*`. `idx_mask` is the same size as `filled_*` and has `True` at all `idx` locations. `cumsum` goes along that array and copies the previous value if it encounters `False` (0) and increments by one when it encounters `True` (1). Python and numpy bools are also integers. That creates a list of the indices into `raw_*` that will create `filled_*`. There will be repetitions in this list whenever an entry in `filled_*` does not match one in `raw_*`, just like you are looking for. — Mad Physicist, May 11 '17 at 21:56
You can just use `~idx_mask` for `fidx`. It's exactly the mask you want. — Mad Physicist, May 11 '17 at 21:58
Yes, but if you go through the intermediate steps, you see you need to increment past all the false values. Ex: [T, F, F, T, T, F, F, F, T, T] (idx_mask, len 10) => np.cumsum(idx_mask) - 1 => [0, 0, 0, 1, 2, 2, 2, 2, 3, 4] when it should be [0, 0, 0, 3, 4, 4, 4, 4, 8, 9] — encore2097, May 11 '17 at 21:58
I'm not sure that I understand what you mean by "increment past all the false values." — Mad Physicist, May 11 '17 at 22:00
No, you want what it's giving you. Its the index in `raw` that will go into the corresponding position in `filled`. It's not a trivial concept. — Mad Physicist, May 11 '17 at 22:08
Put another way, `indexes` should be the same size as `filled`, but if `len(raw) == 5`, it can only contain 0-4. — Mad Physicist, May 11 '17 at 22:10
Fixed: After `idx_mask[idx] = True`, execute: `idx_mask = np.where(idx_mask, 1, 0) * np.transpose(np.arange(0,idx.size))` — encore2097, May 11 '17 at 22:10
I don't think that's right. I'll edit my answer with examples as soon as I can. I'm on mobile now. — Mad Physicist, May 11 '17 at 22:13
`np.where(idx_mask, 1, 0)` should just return `idx_mask` and transpose does nothing to a 1d array. — Mad Physicist, May 11 '17 at 22:14
When you get to the next true index, you cannot increment by only 1. You need to increment 1 for each false index you passed as well. Does that make sense? Ah thanks: then its `idx_mask = idx_mask * np.arange(0,idx.size)` — encore2097, May 11 '17 at 22:15
Hmm.. its still not quite right, it breaks on consecutive trues. Solution is to cumsum on zeros, and leave the other indices alone. — encore2097, May 11 '17 at 22:24
correct solution: http://stackoverflow.com/questions/30488961/fill-zero-values-of-1d-numpy-array-with-last-non-zero-values ... alternatively use pandas as it has forward and backfill functions for time series data ;) — encore2097, May 11 '17 at 23:15
The second answer there is pretty much exactly what I'm suggesting. — Mad Physicist, May 12 '17 at 03:38
@encore2097. I've added a concrete example to my answer based on your sample inputs. Unless I am completely misunderstanding what you are looking for, it works exactly as expected. — Mad Physicist, May 12 '17 at 12:33
Ah, I see what you're getting at. Your method does the copy and the fill at the same time by copying from the original raw array, hences the indices incrementing by 1. I did a copy first and then fill from the same (filled) array because then I can apply the mask at the same time as the fills and do some other ops. Both will work, I do the ops separately for pipeline adaptability and readability. — encore2097, May 12 '17 at 15:13

numpy time series merge and fill missing values with earlier values

1 Answers1