I've got a DataFrame of ~40,000 rows. The DataFrame looks roughly like:
Unix Time UTC Val. 1 Val. 2 Val. 3
1 1518544176.927486 2018-02-13 17:49:36.927486 5.00 0.25 2.00
2 1518544176.929897 2018-02-13 17:49:36.929897 4.50 1.00 3.00
3 1518544176.932310 2018-02-13 17:49:36.932310 4.00 0.75 1.75
...
Columns 0, 2-4 are type <class 'numpy.float64'>
. Column 1 is type <class 'pandas._libs.tslib.Timestamp'>
. When one plots any of the data columns against time, we see a waveform. However, there are occasional breaks in the acquisition. For instance, we might have:
Unix Time UTC Val. 1 Val. 2 Val. 3
576 1518544181.755085 2018-02-13 17:49:41.755085 0.10 0.01 0.93
577 1518544182.041129 2018-02-13 17:49:42.041129 0.11 0.02 0.95
...
As one can see, there's a ~0.3 s gap between readings 576 and 577. The problem is that when plotting the data, matplotlib connect the dots, even when there's no data. Solutions to this "problem" have been addressed in other questions on Stack Overflow and online at-large, and though I'm not in love with... well, any of them, the best option seems to be to insert NaNs into the data gaps. Since matplotlib doesn't plot NaNs, it's a sneaky way to trick it into making your plot more realistic.
To do this, I start by finding the time delta between the first two readings (this is safe), and using twice that value as my metric for "is there a gap?" I then iterate through the DataFrame, checking the gaps. Upon finding one I create a temporary row of NaNs in the data columns, and time values right in the middle of the acquisition gap in the time columns. I then modify a new DataFrame made of the old one, plus this row. This can be seem here:
df2 = df.copy()
for i, row in df.iterrows():
# The following code checks the delta-t of all timestamp pairs.
# We have i > 0 because it can't activate on the first entry.
if i > 0:
delta_t_unix = row['Unix Time'] - prev_timestamp_unix
delta_t_utc = row['UTC'] - prev_timestamp_utc
# If delta_t_unix > the allowed data gap, add new timestamps and NaNs.
if delta_t_unix > allowed_gap:
time_unix = row['Unix Time'] - (delta_t_unix / 2.0)
time_utc = row['UTC'] - (delta_t_utc / 2.0)
val1 = np.nan
val2 = np.nan
val3 = np.nan
new_row = pd.DataFrame({'Unix Time': time_unix, 'UTC': time_utc,
'Val. 1': val1, 'Val. 2': val2,
'Val. 3': val3}, index = [i])
df2 = pd.concat([df2.ix[:i-1], new_row,
df2.ix[i:]]).reset_index(drop = True)
# Set the previous timestamp for use in the beginning of the loop.
prev_timestamp_unix = row[timestamp_unix]
prev_timestamp_utc = row[timestamp_utc]
# Make the final DataFrame with the completed lists.
df2 = df2[['Unix Time', 'UTC', 'Val. 1', 'Val. 2', 'Val. 3']]
This currently takes ~4.5 seconds, thanks to this question (it used to take ~6.5, as I was foolishly iterating through and creating new lists of each column, and then creating a new DataFrame out of those). However, that's still far slower than I'd expect or prefer. Does anyone have any ideas on how to speed this up? I'm still quite new to Pandas and DataFrames, so I'm sure this could be better. Thanks!
EDIT: Worth mentioning that if I remove the datetime
column, it splits the time in half (though unfortunately I can't remove that in practice).