2

I've got a DataFrame of ~40,000 rows. The DataFrame looks roughly like:

             Unix Time                           UTC  Val. 1  Val. 2  Val. 3
1    1518544176.927486    2018-02-13 17:49:36.927486    5.00    0.25    2.00
2    1518544176.929897    2018-02-13 17:49:36.929897    4.50    1.00    3.00
3    1518544176.932310    2018-02-13 17:49:36.932310    4.00    0.75    1.75
...

Columns 0, 2-4 are type <class 'numpy.float64'>. Column 1 is type <class 'pandas._libs.tslib.Timestamp'>. When one plots any of the data columns against time, we see a waveform. However, there are occasional breaks in the acquisition. For instance, we might have:

               Unix Time                           UTC  Val. 1  Val. 2  Val. 3
576    1518544181.755085    2018-02-13 17:49:41.755085    0.10    0.01    0.93
577    1518544182.041129    2018-02-13 17:49:42.041129    0.11    0.02    0.95
...

As one can see, there's a ~0.3 s gap between readings 576 and 577. The problem is that when plotting the data, matplotlib connect the dots, even when there's no data. Solutions to this "problem" have been addressed in other questions on Stack Overflow and online at-large, and though I'm not in love with... well, any of them, the best option seems to be to insert NaNs into the data gaps. Since matplotlib doesn't plot NaNs, it's a sneaky way to trick it into making your plot more realistic.

To do this, I start by finding the time delta between the first two readings (this is safe), and using twice that value as my metric for "is there a gap?" I then iterate through the DataFrame, checking the gaps. Upon finding one I create a temporary row of NaNs in the data columns, and time values right in the middle of the acquisition gap in the time columns. I then modify a new DataFrame made of the old one, plus this row. This can be seem here:

df2 = df.copy()
for i, row in df.iterrows():
    # The following code checks the delta-t of all timestamp pairs.
    # We have i > 0 because it can't activate on the first entry.
    if i > 0:
        delta_t_unix = row['Unix Time'] - prev_timestamp_unix
        delta_t_utc = row['UTC'] - prev_timestamp_utc
        # If delta_t_unix > the allowed data gap, add new timestamps and NaNs.
        if delta_t_unix > allowed_gap:
            time_unix = row['Unix Time'] - (delta_t_unix / 2.0)
            time_utc = row['UTC'] - (delta_t_utc / 2.0)
            val1 = np.nan
            val2 = np.nan
            val3 = np.nan
            new_row = pd.DataFrame({'Unix Time': time_unix, 'UTC': time_utc,
                                    'Val. 1': val1, 'Val. 2': val2,
                                    'Val. 3': val3}, index = [i])
            df2 = pd.concat([df2.ix[:i-1], new_row,
                            df2.ix[i:]]).reset_index(drop = True)
    # Set the previous timestamp for use in the beginning of the loop.
    prev_timestamp_unix = row[timestamp_unix]
    prev_timestamp_utc = row[timestamp_utc]
# Make the final DataFrame with the completed lists.
df2 = df2[['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3']]

This currently takes ~4.5 seconds, thanks to this question (it used to take ~6.5, as I was foolishly iterating through and creating new lists of each column, and then creating a new DataFrame out of those). However, that's still far slower than I'd expect or prefer. Does anyone have any ideas on how to speed this up? I'm still quite new to Pandas and DataFrames, so I'm sure this could be better. Thanks!

EDIT: Worth mentioning that if I remove the datetime column, it splits the time in half (though unfortunately I can't remove that in practice).

erekalper
  • 857
  • 9
  • 22
  • Actually, I am pretty sure pandas has a built-in function for what you want, I am checking this out and I'll post it for you – joaoavf Feb 19 '18 at 14:16
  • Didn't find anything that would produce exactly what you have built. I was actually mistaken thinking that resampling could to it. – joaoavf Feb 19 '18 at 14:25
  • I've corrected some errors in my post, and it should work now. If you can share how long that code took to run, I would appreciate that. – joaoavf Feb 19 '18 at 14:44

2 Answers2

2

You can resample to 2.4ms using something like:

df['utc_time'] = pd.to_datetime(df['utc_time'])
df.set_index(df['utc_time'])[['val1','val2','val3']].resample('2.4ms', loffset='1.2ms').mean().reset_index()
Stev
  • 1,100
  • 9
  • 16
  • Not a bad idea at all, but unfortunately I don't think resampling is what I'm looking for (I think, unless I misunderstand what it does). I'd like to be able to see every last bit of the data, including the gaps. – erekalper Feb 19 '18 at 14:59
  • If you resample an irregularly sampled waveform onto the smallest time step, which appears to be 2.4ms in the data you've shown, you won't lose any data. – Stev Feb 19 '18 at 15:03
  • The smallest time step will change from dataset to dataset, and there's some jitter within each dataset as well. Granted, that's all easy to account for, but the base point is that I actually need to visualize the gaps in the data, which this method removes (at least, I tried what you posted and it removed the gaps). – erekalper Feb 19 '18 at 15:13
  • I'm not sure what you mean by 'removes gaps'. The code resamples your irregularly sampled time series into a regular one, putting Nans where there are no values. – Stev Feb 19 '18 at 15:27
  • I apologize, as I'm not trying to be obtuse or anything, but this doesn't insert NaNs. In fact, I saved the output to a file and it's got the original's identical time stamp values. Doesn't look like anything's been changed, which makes me think I'm not using this correctly. Sorry! – erekalper Feb 19 '18 at 15:58
  • 1
    No worries, I am trying to be helpful. Are you using an IDE? If you run those two lines, what displays in the console? You just get the input back? If you want to save the result, you will have to add `df1=` to the front of the second line. – Stev Feb 19 '18 at 16:11
  • Ugh. I'm dumb. This is all in a test script right now, and I was accidentally re-redefining df1 before outputting it again. This does indeed work, but I imagine because the min isn't exactly 2.4 ms, it is slightly changing the time stamps. It's also more than doubling the length of the DataFrame, because for each gap it's adding NaNs at 2.4 ms (when you really only need one). ... – erekalper Feb 19 '18 at 16:51
  • 1
    ... That said, it is much faster than what we did above (fractions of seconds at this point, but still). I'm going to leave @joaoavf's as the answer for now, because it perfectly retains the original data, but I will definitely be exploring this. Thank you for your help! – erekalper Feb 19 '18 at 16:53
1

Somethings that can speed that up:

  1. Use df.itertuples() instead of df.iterrows() and it will probably improve somewhat the execution time.

If you can post the improvements I would appreciate, but in my experience it was a huge difference in the cases I tested (10x improvement for much simpler instructions inside the loop).

  1. Instead of using pd.concat to put rows together, create a list of tuples and then generate a DataFrame only after the loop from this list.

    for i, unix_time, utc_time, val1, val2, val3 in df.itertuples():
    
        list_of_values = []
        if i > 0:
            delta_t_unix = unix_time - prev_timestamp_unix
            delta_t_utc = utc_time - prev_timestamp_utc
    
            if delta_t_unix > allowed_gap:
                new_time_unix = unix_time - (delta_t_unix / 2.0)
                new_time_utc = utc_time - (delta_t_utc / 2.0)
                list_of_values.append((new_time_unix, new_time_utc, np.nan, np.nan, np.nan))
    
        # Set the previous timestamp for use in the beginning of the loop.
        prev_timestamp_unix = unix_time 
        prev_timestamp_utc = utc_time 
    
        list_of_values.append(((unix_time , utc_time, val1, val2, val3))
    
    df2 = pd.DataFrame(list_of_values, columns=['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3'])
    

That will probably speed up things quite a lot.

joaoavf
  • 1,343
  • 1
  • 12
  • 25
  • 1
    I'd never heard of `itertupes`. Thank you! I'm running tests now. – erekalper Feb 19 '18 at 14:53
  • 1
    @erekalper, it was missing this line `list_of_values.append(((new_time_unix, new_time_utc, val1, val2, val3))` to fill the old values as well on the new DataFrame – joaoavf Feb 19 '18 at 15:00
  • 1
    Yeah, I noticed :). Thanks for editing! (And technically the values in that new appended list should be `unix_time` and `utc_time`, but I know what you meant.) – erekalper Feb 19 '18 at 15:01
  • @erekalper, ops sorry, I've edited those. Thanks! :) – joaoavf Feb 19 '18 at 15:10
  • 1
    @erekalper, I think using a `queue` or `deque` might speed up things compared to the `list`, but as the dataset is not huge, I think it will be a really slight improvement that might not be worth implementing. – joaoavf Feb 19 '18 at 15:12
  • 1
    Your method (1) brought it down to ~1 s, and (2) brought it down to ~0.75 s. This is unbelievably helpful, and I will definitely be using itertuples in the future! – erekalper Feb 19 '18 at 15:15
  • Thanks for the update and I am glad I could help :) Those numbers make more sense, if you can optimize from there let me know because that would be very helpful for me as well. – joaoavf Feb 19 '18 at 15:17