0

I want to include a plot in my thesis (document will be standard a4 page pdf) for which I have data of two time series, both a continuous values expressed as percentages.

Both time series are over one year without sundays, so something of about 310 data points for each of them.

I tried to come with something like this,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


ts = day_agg_plan_temp.set_index('Date')
ts = ts['2018-01-01': '2019-01-01']

plt.figure(figsize=(20,15))


ax1 = ts.label.plot(grid=True, label='Ground Truth', marker='.')
ax2 = ts.pred.plot(grid=True, label='Prediction', marker='.')

plt.legend()
plt.show()

resulting in this:

enter image description here

This is not really appealing, as there is too much going on and I want to point the difference for each of the data points of the blue and orange line.

So my question is, is there a way to do it better other than shrinking the date range (which I really don't want because this plot is already a snippet of the actual time series which covers almost 3 years)

TheDude
  • 1,205
  • 2
  • 13
  • 21
  • Since this question is about data analysis and visualization, and not about programming, it is probably more suitable for [stats.SE](https://stats.stackexchange.com/). – Nils Werner Jul 30 '19 at 08:51
  • @NilsWerner I thought about putting it there but eventually I saw matplotlib related questions here and since I would like to have suggestions with code, it might be fitting? Not sure... – TheDude Jul 30 '19 at 09:02
  • @Tiendung if you mean by downsampling leaving out data point... that's what I actually really want to avoid. – TheDude Jul 30 '19 at 09:02
  • better way is to [smooth](https://stackoverflow.com/questions/20618804/how-to-smooth-a-curve-in-the-right-way) it. and yes, your data are changed. but as you said, you can't display all of them in a good looking way, then you have to trade off. – AcaNg Jul 30 '19 at 09:17
  • 1
    " I want to point the difference for each of the data points of the blue and orange line". why not getting a new plot for their difference ? – AcaNg Jul 30 '19 at 09:24
  • 1
    My advice is that you plot your raw data in very light pale colors and a trend curve(s) (there are a bunch of algorithms to achieve this) with a stronger color. The raw data will still be there but it will look like it's on the background. The strong color of the trend (smooth curve) will be the first thing the viewer looks at. – armatita Jul 30 '19 at 09:27
  • @Tiendung since it's a plot in the motivational chapter, I don't know if a plot containing only the differences between the two time series is really intuitive for the reader. I feel it might be a bit out of context, whereas if you plot the actual data points and the predicted data points, it might be more obvious that something is going on. – TheDude Jul 30 '19 at 12:51
  • @armatita I like the smoothing idea but I think plotting the original data in the same plot makes things worse again. However, I'm not sure, if I understood you correctly. This is what I came up with: https://imgur.com/a/MitE2KH Without the pastel lines (which I like better): https://imgur.com/a/z203rt6 Would it still correct to only show a smoothed version from a scientific and statistical point-of-view – TheDude Jul 30 '19 at 12:57
  • @TheDude I meant the colors used should make the trend be easier to see than the rest. Give me a minute to come up with an example and I'll put it in an answer. – armatita Jul 30 '19 at 13:12
  • @TheDude yeah, a plot which contains only difference is not enough, but here i mean plotting 2 subplots (1 for pure data, the other for the difference) (and maybe a 3rd one for "trend curve" ?). if the vertical values are not changing too much, then a plot with 2 (or 3) vertically aligned subplots is possible for a landscape A4. – AcaNg Jul 31 '19 at 00:56

1 Answers1

1

Here is some code that generates data using Fractional Brownian motion, calculates a trend using a Savitzky–Golay filter (but use whatever is best for you case study), and plots it in a way the user can see the original data and the trend clearly at the same time.

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter

# Generating some Random Data
def brownian(x0, n, dt, delta, out=None):
    x0 = np.asarray(x0)
    r = norm.rvs(size=x0.shape + (n,), scale=delta * sqrt(dt))
    if out is None:
        out = np.empty(r.shape)
    np.cumsum(r, axis=-1, out=out)
    out += np.expand_dims(x0, axis=-1)
    return out

delta = 2
T = 10.0
N = 500
dt = T/N
m = 2
x = np.empty((m,N+1))
x[:, 0] = 50
brownian(x[:,0], N, dt, delta, out=x[:,1:])
t = np.linspace(0.0, N*dt, N+1)

# Obtaining the trend using some arbitrary filter
y1 = savgol_filter(x[0], 51, 3)
y2 = savgol_filter(x[1], 51, 3)

# Plotting the raw data (transparent)
plt.plot(t, x[0], color="red", alpha=0.2)
plt.plot(t, x[1], color="blue", alpha=0.2)

# Plotting the trend data (opaque)
plt.plot(t, y1, color="red")
plt.plot(t, y2, color="blue")

# Calling the plot
plt.show()

The result is this:

raw and smooth data in the same plot

My point is that by playing with the colors (or transparency) you can make some data appear as if in a background, and other (the most relevant usually) as if appearing in the foreground. It's an UX technique (like blurring, darkening, or make the background paler).

You can also play with the line width (or style) if the vertical variability of the data is not enough to clearly separate the sets. In your case I don't think it will be necessary.

armatita
  • 12,825
  • 8
  • 48
  • 49
  • 1
    your smoothed plot is even smoother than the OP's one and also has a good looking. it does not need to present accurate data, and yes, "trend curve" is a perfect name for it. – AcaNg Jul 31 '19 at 00:31