Python - finding pattern in a plot

Question

This graph is generated by the following gnuplot script. The estimated.csv file is found in this link: https://drive.google.com/open?id=0B2Iv8dfU4fTUaGRWMm9jWnBUbzg

# ###### GNU Plot
   set style data lines
   set terminal postscript eps enhanced color "Times" 20

   set output "cubic33_cwd_estimated.eps"

   set title "Estimated signal"

    set style line 99 linetype 1 linecolor rgb "#999999" lw 2
    #set border 1 back ls 11
    set key right top
    set key box linestyle 50
    set key width -2
    set xrange [0:10]
    set key spacing 1.2
    #set nokey

    set grid xtics ytics mytics
    #set size 2
    #set size ratio 0.4

    #show timestamp
    set xlabel "Time [Seconds]"
    set ylabel "Segments"

    set style line 1 lc rgb "#ff0000" lt 1 pi 0 pt 4 lw 4 ps 0

    # Congestion control send window

    plot  "estimated.csv" using ($1):2 with lines title "Estimated";

I wanted to find the pattern of the estimated signal of the previous plot something close to the following plot. My ground truth (actual signal is shown in the following plot)

Here is my initial approach

#!/usr/bin/env python
import sys

import numpy as np
from shapely.geometry import LineString
#-------------------------------------------------------------------------------
def load_data(fname):
    return LineString(np.genfromtxt(fname, delimiter = ','))
#-------------------------------------------------------------------------------
lines = list(map(load_data, sys.argv[1:]))

for g in lines[0].intersection(lines[1]):
    if g.geom_type != 'Point':
        continue
    print('%f,%f' % (g.x, g.y))

Then invoke this python script in my gnuplot directly as in the following:

set terminal pngcairo
set output 'fig.png'

set datafile separator comma
set yr [0:700]
set xr [0:10]

set xtics 0,2,10
set ytics 0,100,700

set grid

set xlabel "Time [seconds]"
set ylabel "Segments"

plot \
    'estimated.csv' w l lc rgb 'dark-blue' t 'Estimated', \
    'actual.csv' w l lc rgb 'green' t 'Actual', \
    '<python filter.py estimated.csv actual.csv' w p lc rgb 'red' ps 0.5 pt 7 t ''

which gives us the following plot. But this does not seem to give me the right pattern as gnuplot is not the best tool for such tasks.

Is there any way where we can find the pattern of the first graph (estimated.csv) by forming the peaks into a plot using python? If we see from the end, the pattern actually seems to be visible. Any help would be appreciated.

Have you tried a Kalman filter, it should follow the curve the way you want it. — Tbaki, Jun 09 '17 at 13:35
Basically, it tries to "follow" your curve with a speed, so it will smooth your signal, but now that i think of it it won't work in your case. :/ It's great to eliminate noise and find the "True" signal, but won't help much in your case, still interresting to look in case you need it in the future. https://en.wikipedia.org/wiki/Kalman_filter#/media/File:Kalman.png — Tbaki, Jun 09 '17 at 13:50
Some ideas through, you could use some peak detection algorithm, then use a clustering algorith like DBSCAN to eliminate oulier, and finally a Kalman Filter to smooth it all. :D https://stackoverflow.com/a/22583761/6341054 http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html http://scipy-cookbook.readthedocs.io/items/KalmanFiltering.html — Tbaki, Jun 09 '17 at 14:10

Franz · Accepted Answer · 2017-06-12T07:53:19.933

I think pandas.rolling_max() is the right approach here. We are loading the data into a DataFrame and calculate the rolling maximum over 8500 values. Afterwards the curves look similar. You may test with the parameter a little bit to optimize the result.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.ion()
names = ['actual.csv','estimated.csv']
#-------------------------------------------------------------------------------
def load_data(fname):
    return np.genfromtxt(fname, delimiter = ',')
#-------------------------------------------------------------------------------

data = [load_data(name) for name in names]
actual_data = data[0]
estimated_data = data[1]
df = pd.read_csv('estimated.csv', names=('x','y'))
df['rolling_max'] = pd.rolling_max(df['y'],8500)
plt.figure()
plt.plot(actual_data[:,0],actual_data[:,1], label='actual')
plt.plot(estimated_data[:,0],estimated_data[:,1], label='estimated')
plt.plot(df['x'], df['rolling_max'], label = 'rolling')

plt.legend()
plt.title('Actual vs. Interpolated')
plt.xlim(0,10)
plt.ylim(0,500)
plt.xlabel('Time [Seconds]')
plt.ylabel('Segments')
plt.grid()
plt.show(block=True)

To answer the question from the comments:

Since pd.rolling() is generating defined windows of your data, the first values will be NaN for pd.rolling().max. To replace these NaNs, I suggest to turn around the whole Series and to calculate the windows backwards. Afterwards, we can replace all the NaNs by the values from the backwards calculation. I adjusted the window length for the backwards calculation. Otherwise we get erroneous data.

This code works:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
plt.ion()

df = pd.read_csv('estimated.csv', names=('x','y'))
df['rolling_max'] = df['y'].rolling(8500).max()
df['rolling_max_backwards'] = df['y'][::-1].rolling(850).max()
df.rolling_max.fillna(df.rolling_max_backwards, inplace=True)
plt.figure()
plt.plot(df['x'], df['rolling_max'], label = 'rolling')

plt.legend()
plt.title('Actual vs. Interpolated')
plt.xlim(0,10)
plt.ylim(0,700)
plt.xlabel('Time [Seconds]')
plt.ylabel('Segments')
plt.grid()
plt.show(block=True)

And we get the following result:

Hi Franz. Thank you so much. Don't use the `actual.csv` file. It is my ground truth and it should not be feed to the program. The pattern has be detected ONLY from `estimated.csv`. — Desta Haileselassie Hagos, Jun 09 '17 at 14:59
Its not used for the calculation. I just used it to show the similarity. If you like to, remove the lines for the `actual.csv` — Franz, Jun 09 '17 at 15:29
I added a solution to the code to deal with the missing values. This is not as clean as I wished it to be, but it works. — Franz, Jun 12 '17 at 07:54
Hi Franz, have you worked with `Kalman Filters` before? I was wondering if we can try to fit this very same data with `Kalman Filter` models. Thanks! — Desta Haileselassie Hagos, Jun 22 '17 at 05:55
Hi Desta, unfortunately I'm not really into Kalman Filters. Ic just read about them but didn't need them so far. Have you tried something like `pykalman`: https://pykalman.github.io ? — Franz, Jun 22 '17 at 10:47
Hi Desta, I'm not a statistics expert. I'm sorry but I can't answer your question. — Franz, Jun 26 '17 at 14:49

Python - finding pattern in a plot

1 Answers1