I've read some data in from csv using genfromtxt
and hstack
to concatenate the data which results in a shape of (5413260,)
(it takes about 17min, ~1GB .npy save file)
The data is in the format:
timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...
>>> data[1:3]
array([ ('2009-01-01T18:41:00', 755, 855, 755, 855, 743, 843, 743, 843, 2),
('2009-01-01T18:43:45', 693, 793, 693, 793, 693, 793, 693, 793, 1)],
dtype=[('datetime', '<M8[s]'), ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')])
I'd like to do deltas on the temps:
timedelta64 1, temp1A - temp1B, temp 1B - temp1C, ...
and fills:
timedelta64 2 - timedelta64 1 <= sample rate, otherwise fill with a stub with the appropriate time stamp:
timedelta64 1 + shift, 0, 0, 0, CONSTANT, ...
I'm currently:
- iterating through the numpy arrayA (arrayA[i], arrayA[i+1])
- calculate the delta for row_i, append to numpy arrayB
- calculate the time difference between row_i+1 and row_i
- iteratively add to shift to timestamp, fill with zeros/constant, append to numpy arrayB
This is highly inefficient - its taken over 12 hours so far and I expect it will take ~ 100+ days to complete.
Whats the vectorized approach?
I'm thinking a vector op to calculate the deltas first, then I'm not sure how to quickly batch and insert the fills for the missing timestamps.
Also, is it faster to reshape -> diff -> fill or reshape -> fill -> diff?
Aside: this is for pre-processing data for machine learning with tensorflow, is there a better tool than numpy?