2

I want to annotate a plot of multivariate time-series with time intervals (in colour for each type of annotation).

data overview

An example dataset looks like this:

            metrik_0  metrik_1  metrik_2  geospatial_id  topology_id  \
2020-01-01 -0.848009  1.305906  0.924208             12            4   
2020-01-01 -0.516120  0.617011  0.623065              8            3   
2020-01-01  0.762399 -0.359898 -0.905238             19            3   
2020-01-01  0.708512 -1.502019 -2.677056              8            4   
2020-01-01  0.249475  0.590983 -0.677694             11            3   

            cohort_id  device_id  
2020-01-01          1          1  
2020-01-01          1          9  
2020-01-01          2         13  
2020-01-01          2          8  
2020-01-01          1         12  

The labels look like this:

cohort_id marker_type               start                 end
0          1           a 2020-01-02 00:00:00                 NaT
1          1           b 2020-01-04 05:00:00 2020-01-05 16:00:00
2          1           a 2020-01-06 00:00:00                 NaT

desired result

  • multivariate plot of all the time-series of a cohort_id
  • highlighting for the markers (different color for each type)
    • notice the markers might overlay / transparency is useful
    • there will be attenuation around the marker type a (configured by the number of hours)

I thought about using seaborn/matplotlib for this task.

So far I have come around:

%pylab inline
import seaborn as sns; sns.set()
import matplotlib.dates as mdates

aut_locator = mdates.AutoDateLocator(minticks=3, maxticks=7)
aut_formatter = mdates.ConciseDateFormatter(aut_locator)

g = df[df['cohort_id'] == 1].plot(figsize=(8,8))
g.xaxis.set_major_locator(aut_locator)
g.xaxis.set_major_formatter(aut_formatter)
plt.show()

which is rather chaotic. I fear, it will not be possible to fit the metrics (multivariate data) into a single plot. It should be facetted by each column. However, this again would require to reshape the dataframe for seaborn FacetGrid to work, which also doesn`t quite feel right - especially if the number of elements (time-series) in a cohort_id gets larger. If FacetGrid is the right way, then something along the lines of: https://seaborn.pydata.org/examples/timeseries_facets.html would be the first part, but the labels would still be missing.

How could the labels be added? How should the first part be accomplished?

An example of the desired result: https://i.stack.imgur.com/JYilG.jpg, i.e. one of enter image description here

for each metric value

code for the example data

The datasets are generated from the code snippet below:

import pandas as pd
import numpy as np

import random
random_seed = 47
np.random.seed(random_seed)
random.seed(random_seed)
def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
        df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
        df.columns = [f'metrik_{c}' for c in df.columns]
        df['geospatial_id'] = geo_id
        df['topology_id'] = topology_id
        df['cohort_id'] = cohort_id
        df['device_id'] = device_id
        return df
    
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
    results = []
    for i in range(1, n_devices +1):
        #print(i)
        r = random.randrange(1, n_devices)
        cohort = random.randrange(1, cohort_levels)
        topo = random.randrange(1, topo_levels)
        df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
        results.append(df_single_dvice)
        #print(r)
    return pd.concat(results)

# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 3
n_devices = 20
cohort_levels = 3
topo_levels = 5

df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df.head()

marker_labels = pd.DataFrame({'cohort_id':[1,1, 1], 'marker_type':['a', 'b', 'a'], 'start':['2020-01-2', '2020-01-04 05', '2020-01-06'], 'end':[np.nan, '2020-01-05 16', np.nan]})
marker_labels['start'] = pd.to_datetime(marker_labels['start'])
marker_labels['end'] = pd.to_datetime(marker_labels['end'])
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

2

In general, you can use either plt.fill_between for horizontal and plt.fill_betweenx for vertical bands. For "bands-within-bands" you can just call the method twice.

A basic example using your data would look like this. I've used fixed values for the position of the bands, but you can put them on the main dataframe and reference them dynamically inside the loop.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(3 ,figsize=(20, 9), sharex=True)
plt.subplots_adjust(hspace=0.2)

metriks = ["metrik_0", "metrik_1", "metrik_2"]
colors = ['#66c2a5', '#fc8d62', '#8da0cb'] #Set2 palette hexes

for i, metric in enumerate(metriks):
    
    df[[metric]].plot(ax=ax[i], color=colors[i], legend=None)
    ax[i].set_ylabel(metric)

    ax[i].fill_betweenx(y=[-3, 3], x1="2020-01-04 05:00:00",
                        x2="2020-01-05 16:00:00", color='gray', alpha=0.2)
    ax[i].fill_betweenx(y=[-3, 3], x1="2020-01-04 15:00:00",
                        x2="2020-01-05 00:00:00", color='gray', alpha=0.4)

enter image description here

gherka
  • 1,416
  • 10
  • 17
  • Could this (fill between also be vectorized) to take a whole list of start/end or is explicit iteration with a for loop required? – Georg Heiler Oct 14 '20 at 16:23
  • I don't think so. The [docs](https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.axes.Axes.fill_betweenx.html) say that it fills the space between two curves, not between an array of curves. – gherka Oct 14 '20 at 16:27
  • Furthermore, you are currently plotting the data from all the devices over another (for a single category). Will it be possible to use a different line styles for these categories? – Georg Heiler Oct 14 '20 at 16:39
  • Sure. You can see the full list of possible line styles [here](https://matplotlib.org/gallery/lines_bars_and_markers/line_styles_reference.html). – gherka Oct 14 '20 at 16:43
  • I know that seaborn has a parameter called: `style`, but so far do not see how it translates over: would I need to manually iterate over all of the devices to change their line styles? – Georg Heiler Oct 15 '20 at 12:51
  • 1
    So far, when using seaborn I can plot https://gist.github.com/geoHeil/67911bfad9fe079b9a342431448ec4fe. Furthermore, axvspan seems more suitable as the shading is applied to the full thing – Georg Heiler Oct 15 '20 at 13:21
  • @GeorgHeiler You can easily apply the shading to the full extent of the y-axis by running `ax[i].set_ylim(*ax[i].get_ylim())` just after the plot is created and editing the y argument of `fill_between` like this: `ax[i].fill_betweenx(y=[*ax[i].get_ylim()], ...)`. See [this similar answer](https://stackoverflow.com/a/66052245/14148248). – Patrick FitzGerald Feb 04 '21 at 21:38