3

I recently started to explore into the depths of pandas and would like to visualize some time-series data which contains gaps, some of them rather large. an example mydf:

             timestamp       val
0  2016-07-25 00:00:00  0.740442
1  2016-07-25 01:00:00  0.842911
2  2016-07-25 02:00:00 -0.873992
3  2016-07-25 07:00:00 -0.474993
4  2016-07-25 08:00:00 -0.983963
5  2016-07-25 09:00:00  0.597011
6  2016-07-25 10:00:00 -2.043023
7  2016-07-25 12:00:00  0.304668
8  2016-07-25 13:00:00  1.185997
9  2016-07-25 14:00:00  0.920850
10 2016-07-25 15:00:00  0.201423
11 2016-07-25 16:00:00  0.842970
12 2016-07-25 21:00:00  1.061207
13 2016-07-25 22:00:00  0.232180
14 2016-07-25 23:00:00  0.453964

now i could plot my DataFrame through df1.plot(x='timestamp').get_figure().show() and data along the x-axis would be interpolated (appearing as one line): plot0

what i would like to have instead is:

  • visible gaps between sections with data
  • a consistent gap-width for differing gaps-legths
  • perhaps some form of marker in the axis which helps to clarify the fact that some jumps in time are performed.

researching in this matter i've come across

which generally come close to what i'm after but the former approach would yield in simply leaving the gaps out of the plotted figure and the latter in large gaps that i would like to avoid (think of gaps that may even span a few days).

as the second approach may be closer i tried to use my timestamp-column as an index through:

mydf2 = pd.DataFrame(data=list(mydf['val']), index=mydf[0])

which allows me to fill the gaps with NaN through reindexing (wondering if there is a more simple solution to achive this):

mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='H'))

leading to:

                          val
2016-07-25 00:00:00  0.740442
2016-07-25 01:00:00  0.842911
2016-07-25 02:00:00 -0.873992
2016-07-25 03:00:00       NaN
2016-07-25 04:00:00       NaN
2016-07-25 05:00:00       NaN
2016-07-25 06:00:00       NaN
2016-07-25 07:00:00 -0.474993
2016-07-25 08:00:00 -0.983963
2016-07-25 09:00:00  0.597011
2016-07-25 10:00:00 -2.043023
2016-07-25 11:00:00       NaN
2016-07-25 12:00:00  0.304668
2016-07-25 13:00:00  1.185997
2016-07-25 14:00:00  0.920850
2016-07-25 15:00:00  0.201423
2016-07-25 16:00:00  0.842970
2016-07-25 17:00:00       NaN
2016-07-25 18:00:00       NaN
2016-07-25 19:00:00       NaN
2016-07-25 20:00:00       NaN
2016-07-25 21:00:00  1.061207
2016-07-25 22:00:00  0.232180
2016-07-25 23:00:00  0.453964

from here on i might need to reduce consecutive entries over a certain limit with missing data to a fix number (representing my gap-width) and do something to the index-value of these entries so they are plotted differently but i got lost here i guess as i don't know how to achieve something like that.

while tinkering around i wondered if there might be a more direct and elegant approach and would be thankful if anyone knowing more about this could point me towards the right direction.

thanks for any hints and feedback in advance!

### ADDENDUM ###

After posting my question I've come across another interesting idea postend by Andy Hayden that seems helpful. He's using a column to hold the results of a comparison of the difference with a time-delta. After performing a cumsum() on the int-representation of the boolean results he uses groupby() to cluster entries of each ungapped-series into a DataFrameGroupBy-object.

As this was written some time ago pandas now returns timedelta-objects so the comparison should be done with another timedelta-object like so (based on the mydf from above or on the reindexed df2 after copying its index to a now column through mydf2['timestamp'] = mydf2.index):

from datetime import timedelta
myTD = timedelta(minutes=60)
mydf['nogap'] = mydf['timestamp'].diff() > myTD
mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() 
## btw.: why not "... .apply(lambda x: int(x)) ..."?
dfg = mydf.groupby('nogap')

We now could iterate over the DataFrameGroup getting the ungapped series and do something with them. My pandas/mathplot-skills are way too immature but could we plot the group-elements into sub-plots? maybe that way the discontinuity along the time-axis could be represented in some way (in form of an interrupted axis-line or such)?

piRSquared's answer already leads to a quite usable result with the only thing kind of missing being a more striking visual feedback along the time-axis that a gap/time-jump has occurred between two values.

Maybe with the grouped Sections the width of the gap-representation could be more configurable?

Community
  • 1
  • 1
antiplex
  • 938
  • 2
  • 12
  • 17

1 Answers1

2

I built a new series and plotted it. This is not super elegant! But I believe gets you what you wanted.

Setup

Do this to get to your starting point

from StringIO import StringIO
import pandas as pd

text = """          timestamp       val
2016-07-25 00:00:00   0.740442
2016-07-25 01:00:00   0.842911
2016-07-25 02:00:00  -0.873992
2016-07-25 07:00:00  -0.474993
2016-07-25 08:00:00  -0.983963
2016-07-25 09:00:00   0.597011
2016-07-25 10:00:00  -2.043023
2016-07-25 12:00:00   0.304668
2016-07-25 13:00:00   1.185997
2016-07-25 14:00:00   0.920850
2016-07-25 15:00:00   0.201423
2016-07-25 16:00:00   0.842970
2016-07-25 21:00:00   1.061207
2016-07-25 22:00:00   0.232180
2016-07-25 23:00:00   0.453964"""

s1 = pd.read_csv(StringIO(text),
                 index_col=0,
                 parse_dates=[0],
                 engine='python',
                 sep='\s{2,}').squeeze()

s1

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Resample hourly. resample is a deferred method, meaning it expects you to pass another method afterwards so it knows what to do. I used mean. For your example, it doesn't matter because we are sampling to a higher frequency. Look it up if you care.

s2 = s1.resample('H').mean()

s2

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 04:00:00         NaN
2016-07-25 05:00:00         NaN
2016-07-25 06:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 18:00:00         NaN
2016-07-25 19:00:00         NaN
2016-07-25 20:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Freq: H, Name: val, dtype: float64

Ok, so you also wanted equally sized gaps. This was a tad tricky. I used ffill(limit=1) to fill in only one space of each gap. Then I took the slice of s2 where this forward filled thing was not null. This gives me a single null for each gap.

s3 = s2[s2.ffill(limit=1).notnull()]

s3

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Lastly, if I plotted this, I still get irregular gaps. I need str indices so that matplotlib doesn't try to expand out my dates.

s3.reindex(s3.index.strftime('%H:%M'))

timestamp
00:00    0.740442
01:00    0.842911
02:00   -0.873992
03:00         NaN
07:00   -0.474993
08:00   -0.983963
09:00    0.597011
10:00   -2.043023
11:00         NaN
12:00    0.304668
13:00    1.185997
14:00    0.920850
15:00    0.201423
16:00    0.842970
17:00         NaN
21:00    1.061207
22:00    0.232180
23:00    0.453964
Name: val, dtype: float64

I'll plot them together so we can see the difference.

f, a = plt.subplots(2, 1, sharey=True, figsize=(10, 5))
s2.plot(ax=a[0])
s3.reindex(s3.index.strftime('%H:%M')).plot(ax=a[1])

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    this comes quite close to what i thought of, thanks for sharing this! using `resample` instead of `reindex` seems like a good idea as well as the nice use of `ffill` in combination of `notnull()`. guess i need to take a closer look into how to deal subplotting to understand the last piece... – antiplex Jul 26 '16 at 08:54