0

I've read some data in from csv using genfromtxt and hstack to concatenate the data which results in a shape of (5413260,) (it takes about 17min, ~1GB .npy save file)

The data is in the format:

timedelta64 1, temp1A, temp 1B, temp1C, ...
timedelta64 2, temp2A, temp 2B, temp2C, ...


>>> data[1:3]
array([ ('2009-01-01T18:41:00', 755, 855, 755, 855, 743, 843, 743, 843, 2),
       ('2009-01-01T18:43:45', 693, 793, 693, 793, 693, 793, 693, 793, 1)],
      dtype=[('datetime', '<M8[s]'), ('sensorA', '<u4'), ('sensorB', '<u4'), ('sensorC', '<u4'), ('sensorD', '<u4'), ('sensorE', '<u4'), ('sensorF', '<u4'), ('sensorG', '<u4'), ('sensorH', '<u4'), ('signal', '<u4')])

I'd like to do deltas on the temps:

timedelta64 1, temp1A - temp1B, temp 1B - temp1C, ...

and fills:

timedelta64 2 - timedelta64 1 <= sample rate, otherwise fill with a stub with the appropriate time stamp:

timedelta64 1 + shift, 0, 0, 0, CONSTANT, ...

I'm currently:

  1. iterating through the numpy arrayA (arrayA[i], arrayA[i+1])
  2. calculate the delta for row_i, append to numpy arrayB
  3. calculate the time difference between row_i+1 and row_i
  4. iteratively add to shift to timestamp, fill with zeros/constant, append to numpy arrayB

This is highly inefficient - its taken over 12 hours so far and I expect it will take ~ 100+ days to complete.

Whats the vectorized approach?

I'm thinking a vector op to calculate the deltas first, then I'm not sure how to quickly batch and insert the fills for the missing timestamps.

Also, is it faster to reshape -> diff -> fill or reshape -> fill -> diff?

Aside: this is for pre-processing data for machine learning with tensorflow, is there a better tool than numpy?

encore2097
  • 481
  • 3
  • 8
  • 18

1 Answers1

0

Since I'm using genfromtxt and heterogenous dtypes, vectorized operations are accomplished through named columns: to slice columns in a tuple present in a numpy array

Generating a range of numpy.datetime64: How can I make a python numpy arange of datetime

Concatenating large arrays in numpy is slow, its best to use pre-allocated array and fill in using slices: How to add items into a numpy array

Then how to merge two structured / record arrays based on matching datetime64 and masking the appropriate fields. Which is found here: Compare two numpy arrays by first Column and create a third numpy array by concatenating two arrays

Overall speedup looks like 100+ days => <5 min (28,800x faster). The pre-allocated array should also speed up loading from csv.

Community
  • 1
  • 1
encore2097
  • 481
  • 3
  • 8
  • 18
  • This doesn't appear to be an answer. Additional information about your question should be add to the question itself. – Paul H May 10 '17 at 16:26
  • Its most certainly the answer short of copy and pasting all the code which is unique to my data. I hope the information I've gathered here will be helpful to others new to numpy experiencing difficulty pre-processing time series data. – encore2097 May 10 '17 at 16:34