6

I have the following dataframe:

date    Values
3/1/2018    
3/3/2018    0
3/5/2018    -0.011630952
3/8/2018    0.024635792
3/10/2018   
3/10/2018   0.013662755
3/13/2018   2.563770771
3/15/2018   0.026081264
3/17/2018   
3/25/2018   4.890818119
3/26/2018   
3/28/2018   0.994944572
3/30/2018   0.098569691
4/2/2018    
4/2/2018    2.261398315
4/4/2018    2.595984459
4/7/2018    2.145072699
4/9/2018    2.401818037
4/11/2018   
4/12/2018   2.233839989
4/14/2018   2.179880142
4/17/2018   0.173141539
4/18/2018   
4/19/2018   0.04037559
4/22/2018   2.813424349
4/24/2018   2.764060259
4/27/2018   
5/2/2018    4.12789917
5/4/2018    4.282546997
5/4/2018    
5/7/2018    5.083333015
5/13/2018   
5/14/2018   1.615991831
5/17/2018   0.250209153
5/19/2018   5.003758907
5/20/2018   
5/22/2018   
5/24/2018   0.177665412
5/29/2018   
6/1/2018    3.190019131
6/3/2018    3.514900446
6/5/2018    2.796386003
6/6/2018    4.132686615
6/8/2018    
6/11/2018   2.82530117
6/14/2018   
6/16/2018   1.786619782
6/18/2018   
6/21/2018   1.60535562
6/21/2018   1.737388611
6/23/2018   0.048161745
6/26/2018   1.811254263
6/28/2018   0.109187543
6/30/2018   
7/1/2018    0.086753845
7/3/2018    2.141263962
7/6/2018    1.116563678
7/7/2018    1.159829378
7/8/2018    0.107431769
7/11/2018   -0.001963556
7/13/2018   
7/16/2018   
7/16/2018   0.071490705
7/18/2018   1.052834034
7/21/2018   
7/23/2018   
7/23/2018   1.201774001
7/28/2018   0.218167484
7/31/2018   0.504413128
8/1/2018    
8/2/2018    
8/5/2018    1.057194233
8/7/2018    0.85014987
8/8/2018    1.183927178
8/10/2018   1.226516366
8/12/2018   1.533656836
8/15/2018   
8/17/2018   
8/17/2018   1.355006456
8/20/2018   1.490438223
8/22/2018   
8/24/2018   1.160542369
8/25/2018   1.546550632
8/27/2018   
8/30/2018   

which looks like so:

enter image description here

I want to filter out all the troughs between the peaks if the distance between the peaks is less than 14 days. e.g. I want to filter out the low values between the peaks at 5/7/2018 and5/19/2018 and replace those values by NaNs. There are a lot of scipy filters which can help with smoothing, but I am not sure how to remove the troughs based on the condition I specified. Output should look something like this (if we fit a curve after removing the troughs):

enter image description here

Based on @Asmus's suggestions, I expect to have one peak in the final results, therefore a gaussian distribution might be best (with emphasis on might).

user308827
  • 21,227
  • 87
  • 254
  • 417
  • 1
    Can you just show us some expected output base on the sample data ? Also , what you mean peaks here – BENY Apr 21 '19 at 16:47
  • @Wen-Ben, any peaks identified by the find_peaks command works. I understand that can lead to many different possible peaks. – user308827 Apr 22 '19 at 02:14
  • just a random thought, I would have done exactly what you did to show the desired output (i.e. I would have removed the troughs and interpolated only the remaning observations). What is not working with this approach? – CAPSLOCK Apr 23 '19 at 16:48
  • I think you need to look at [`scipy.signal`](https://docs.scipy.org/doc/scipy/reference/tutorial/signal.html) – Scott Boston Apr 23 '19 at 16:52
  • @Gio, I did not remove the troughs. I just manually drew what I thought the interpolated curve should look like once the troughs are removed – user308827 Apr 24 '19 at 15:16
  • 1
    @user308827 Impressive work! =) – CAPSLOCK Apr 24 '19 at 15:33
  • Could you please clarify what is your operational definition what you consider to be a peak and trough? Is a peak a global max value? Is a trough the local minimum between two peaks? – VanBantam Apr 24 '19 at 17:52
  • I would look at the following post because it discuss some existing functions as well as some important concepts relating to defining peaks. https://stackoverflow.com/questions/1713335/peak-finding-algorithm-for-python-scipy – VanBantam Apr 24 '19 at 18:12
  • @user308827 - what are the trough values you're wanting to remove? Are they within a deviation from 0 or outside of a standard deviation? – brddawg Apr 25 '19 at 22:30
  • I've updated my answer with data rejection and single-peak-fitting, as requested :) – Asmus Apr 26 '19 at 11:08

2 Answers2

5

Important note: Because this answer was quite long already, I've decided to rewrite it completely, instead of updating it a 5th time. Go check out the version history if you're interested in the "historical context"


First, run some required imports:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import gridspec
import matplotlib as mpl
mpl.style.use('seaborn-paper') ## for nicer looking plots only

from lmfit import fit_report
from lmfit.models import GaussianModel, BreitWignerModel

Then clean up the data (as posted above, saved as .csv):

df=pd.read_csv('pdFilterDates.txt',delim_whitespace=True) ## data as given above
df['date'] = pd.to_datetime(df['date'],format = '%m/%d/%Y')

## initial cleanup
df=df.dropna() ## clean initial NA values, e.g. 3/10/2018

## there is a duplicate at datetime.date(2018, 6, 21):
# print(df['date'][df.date.duplicated()].dt.date.values) 
df=df.groupby('date').mean().reset_index() ## so we take a mean value here
# print(df['date'][df.date.duplicated()].dt.date.values) ## see, no more duplicates

df = df.set_index('date',drop=False) ## now we can set date as index

and re-index it on a daily frequency:

complete_date_range_idx = pd.date_range(df.index.min(), df.index.max(),freq='D') 
df_filled=df.reindex(complete_date_range_idx, fill_value=np.nan).reset_index()
## obtain index values, which can be understood as time delta in days from the start

idx=df_filled.index.values ## this will be used again, in the end

## now we obtain (x,y) on basis of idx
not_na=pd.notna(df_filled['Values'])
x=idx[not_na]     ## len: 176 
y=df_filled['Values'][not_na].values

### let's write over the original df
df=df_filled 
#####

And now for the interesting part: fitting the data with some asymmetric lineshape (Breit-Wigner-Fano) and remove the "outliers" that lie below a certain threshold. We do this first by manually declaring where this peak is supposed to be (our initial guess, we can remove 3 points), then we do it again by using the fit (fit 1) as input (removing 8 points) and finally obtain our final fit.

As requested, we can now interpolate the fit back on the daily index that we have created before (bwf_result_final.eval(x=idx)) and create additional columns in the dataframe: y_fine, which holds only the fit, y_final, which holds the final point cloud (i.e. after the outlier removal), and a joined dataset (which looks "jagged") y_joined. Finally, we can plot this on basis of the "fine" data range (df['index']).

Figure 1: iteratively removing outliers

Figure 2: cleaned up dataset

# choose an asymmetric line shape (Fano resonance)
bwf_model = BreitWignerModel()

# make initial guesses: 
params = bwf_model.make_params(center=75, amplitude=0.2, sigma=20, q=1/0.2)

# plot initial guess and fit result    
bwf_result = bwf_model.fit(y, params, x=x)

####------------------ create first figure----------
fig=plt.figure(figsize=(8,3),frameon=True,)
gs1 = gridspec.GridSpec(1,3,
    left=0.08,right=0.95,
    bottom=0.15,top=0.9,
    wspace=0.1
)

a1=plt.subplot(gs1[0])
a2=plt.subplot(gs1[1])
a3=plt.subplot(gs1[2])


#------------------ first subplot ------------------
a1.set_title('Outliers from 1st guess')
## show initial x,y
a1.scatter(x,y,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)

# outliers=np.argwhere(np.abs(y-bwf_result.init_fit)>1.9) ## if you want to exclude points both above and below
outliers=np.argwhere(( bwf_result.init_fit -y ) >1.9)

# remove outliers from point cloud
x_new=np.delete(x,outliers)
y_new=np.delete(y,outliers)

#### run a fit on the "cleaned" dataset
bwf_result_mod = bwf_model.fit(y_new, params, x=x_new)

a1.plot(x, bwf_result.init_fit, 'r--',label='initial guess')
a1.fill_between(x, bwf_result.init_fit, bwf_result.init_fit-1.9, color='r', hatch='///',alpha=0.2,zorder=1,label=u'guess - 1.9')
a1.scatter(x[outliers],y[outliers],c='r',marker='x',s=10**2,linewidth=1,zorder=4,label='outliers') ## show outliers
a1.plot(x_new, bwf_result_mod.best_fit, color='g',label='fit 1')

pointsRemoved=len(y)-len(y_new)
a1.text(1.05,0.5,u'↓{0} points removed'.format(pointsRemoved),ha='center',va='center',rotation=90,transform=a1.transAxes)

#------------------ second plot ------------------
a2.set_title('Outliers from 1st fit')
## show initial x,y
a2.scatter(x,y,facecolors='None',edgecolors='grey',marker='o',linewidth=.5,zorder=0,label='original data')

a2.scatter(x_new,y_new,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)
a2.plot(x_new, bwf_result_mod.best_fit, color='g',label='fit 1')

# new_outliers=np.argwhere(np.abs(bwf_result_mod.residual)>0.8) ## if you want to exclude points both above and below
new_outliers=np.argwhere( bwf_result_mod.residual >0.8)

x_new_2=np.delete(x_new,new_outliers)
y_new_2=np.delete(y_new,new_outliers)

a2.scatter(x_new[new_outliers],y_new[new_outliers],c='r',marker='x',s=10**2,linewidth=1,zorder=4,label='new outliers')
a2.fill_between(x_new, bwf_result_mod.best_fit, bwf_result_mod.best_fit-0.8, color='r', hatch='///',alpha=0.2,zorder=1,label=u'fit - 0.8')

pointsRemoved=len(y_new)-len(y_new_2)
a2.text(1.05,0.5,u'↓{0} points removed'.format(pointsRemoved),ha='center',va='center',rotation=90,transform=a2.transAxes)

#------------------ third plot ------------------
_orig=len(y)
_remo=(len(y)-len(y_new_2))
_pct=_remo/(_orig/100.)
a3.set_title(u'Result ({0} of {1} removed, ~{2:.0f}%)'.format(_orig,_remo,_pct ))

x_final=np.delete(x_new,new_outliers)
y_final=np.delete(y_new,new_outliers)

## store final point cloud in the df
df.loc[x_final,'y_final']=y_final

a3.scatter(x_final,y_final,facecolors='None',edgecolors='b',marker='o',linewidth=1,zorder=3)

## make final fit:
bwf_result_final = bwf_model.fit(y_final, params, x=x_final)
a3.scatter(x,y,facecolors='None',edgecolors='grey',marker='o',linewidth=.5,zorder=0,label='original data')

a3.plot(x_final, bwf_result_final.best_fit, color='g',label='fit 2')

## now that we are "happy" with bwf_result_final, let's apply it on the df's "fine" (i.e. daily) index!
y_fine=bwf_result_final.eval(x=idx)
## 
df['y_fine']=y_fine # store fit function
df['y_joined']=df['y_final'] # store final point cloud
df['y_joined'][df['y_final'].isnull()]=df['y_fine'] # join fit function points with final point cloud

####------------------ create second figure----------

fig2=plt.figure(figsize=(8,3),frameon=True,)
gs2 = gridspec.GridSpec(1,1,
    left=0.08,right=0.95,
    bottom=0.15,top=0.9,
    wspace=0.1
)
ax2=plt.subplot(gs2[0])


ax2.scatter(df['date'],df['Values'],facecolors='None',edgecolors='grey',marker='o',linewidth=1,zorder=0,label='original data')
ax2.plot(df['index'],df['y_fine'],c="g",zorder=3,label="final fit applied to all dates")
ax2.plot(df['index'],df['y_joined'],color="r",marker=".",markersize=6,zorder=2,label="(points-outliers) +fit ")

# print(df.head(30))

for a in [a1,a2,a3,ax2]:
    a.set_ylim(-.5,7)
    a.legend()

a1.set_ylabel('Value')
ax2.set_ylabel('Value')

for a in [a2,a3]:
    plt.setp(a.get_yticklabels(),visible=False)
for a in [a1,a2,a3,ax2]:
    a.set_xlabel('Days from start')

fig.savefig('outlier_removal.pdf')
fig2.savefig('final_data.pdf')
plt.show()
Asmus
  • 5,117
  • 1
  • 16
  • 21
  • thanks @Asmus, I will update question based on your suggestions – user308827 Apr 25 '19 at 15:41
  • Very good answer, Asmus! Just a comment. Once you have your initial guess line of Fano model, why not measure distances from there to the points below to reject those that are certain distance away? Seems better approach than applying an static mask, at least, more "scientific" – Haritz Laboa Apr 26 '19 at 13:10
  • 1
    @HaritzLaboa I agree, several careful iterations of outlier removal + fitting would be best. But let's be honest, the whole approach is pretty unscientific :-) We don't know what these values represent, or how they "should behave" over time, i.e. which model to use. I could have likely fitted a sine curve and gotten away with similar residuals. Consequentially, we have no way of weighting the credibility of certain points, thus one may as well manually remove outliers based on "looks" alone. Note that the initial guess is completely manual, thus one may have already stopped there. – Asmus Apr 26 '19 at 13:25
  • 1
    @HaritzLaboa By the way, if you want to see what your idea would look like, you can find outliers with: `outliers=np.argwhere(np.abs(y-bwf_result.init_fit)>1.9)` and then e.g. plot them: `ax2.scatter(x[outliers],y[outliers],c="r",marker="x",zorder=4)`. – Asmus Apr 26 '19 at 13:39
  • 1
    @HaritzLaboa …. aaaand I couldn't resist and try your approach see **Update 2** above. :-) – Asmus Apr 26 '19 at 14:22
  • 1
    So nice @Asmus!! Clap clap!! As final suggestion... (XD) ... we could also take care just of the points beneath the line, so to maintain and preserve upper ones that (always according to the creator of the question) are not considered outliers ;-) – Haritz Laboa Apr 26 '19 at 14:34
  • 1
    @HaritzLaboa Sure, why not (see above)! *Footnote:* why should the outliers not be normally distributed, i.e. outside a ±envelope? – Asmus Apr 26 '19 at 14:40
  • You got it!!! I think that's what he/she was looking for. *Footnote response: No idea, ask it to @user308827 XD* – Haritz Laboa Apr 26 '19 at 14:49
  • thanks @Asmus, this looks great. One comment, I do not want to exclude the data points like you do here: `df=df.dropna() ## clean initial NA values, e.g. 3/10/2018`. Is there some way to retain them in your final solution? I usually interpolate them initially before the fitting, but it doesnt lead to great results – user308827 Apr 26 '19 at 15:05
  • @user308827 How are you interpolating them right now, manually? Because *in theory*, you could first reindex via `df=df.reindex(pd.date_range(df.index.min(), df.index.max(),freq='D'), fill_value=np.nan)` and then use `df['Values']=df['Values'].fillna(df['Values'].interpolate())` to interpolate them linearly; but your data is not following linear trends, is it?? I mean once you have a fit, you can interpolate data from it, but if you do it beforehand, you're implying to know the "temporal behaviour" of the data, hence you wouldn't need a fit?! Kind of a chicken&egg situation here. – Asmus Apr 26 '19 at 15:22
  • @Asmus, that is a great point. I interpolate them before the fit using a similar strategy that you just wrote. It does lead to issues that you point out, can your solution be altered to fill in the data AFTER you have removed the outliers in UPDATE 3 – user308827 Apr 26 '19 at 15:51
  • @user308827 Do you mean you want to: (a) extrapolate the green line (right panel in Update 3) onto the whole data range on a daily basis (giving you a "perfectly" smooth result with small jumps where your original data is) or rather (b) disregard the fit, interpolate the remaining 41 points linearly (!), likely giving you a very "jagged" result? – Asmus Apr 26 '19 at 16:04
  • @Asmus, I mean I want (a) – user308827 Apr 26 '19 at 16:21
  • @user308827 I've updated my answer again, this time completely rewriting it to make it "readable" again. Hope you like it! – Asmus Apr 27 '19 at 13:03
  • thanks @Asmus, one question, in `outliers=np.argwhere(( bwf_result.init_fit -y ) > 1.9)` how did you estimate 1.9? and how can it be extended to other cases? – user308827 Apr 27 '19 at 16:13
  • @user308827 I only "guesstimate" this number (as the 0.8 in the next iteration), but you could e.g. use a histogram (like `plt.hist(np.abs(y-bwf_result.init_fit),bins=np.arange(0,4.5,0.05)`) to verify how many points you're about to remove and "choose wisely", i.e. try to only remove extreme outliers. – Asmus Apr 27 '19 at 16:17
1

Try this:

# first find the peaks
# interpolate is important for find_peaks to work
peaks = (find_peaks(df.set_index('date').interpolate()
         .reset_index().Values, rel_height=0.1)[0])

# copy the peaks' dates for easy manipulation
peak_df = df.loc[peaks, ['date']].copy()

# mark where the peak was too close to the last
markers = (peak_df.date - peak_df.date.shift()).le(pd.Timedelta('14d'))

# filter
# df[markers.notnull()               # where the peaks are
#   | (~markers.bfill().eq(False))] # those between the peaks that are far enough

# as the above code gives an error
markers = ((markers.notnull() | (~markers.bfill().eq(False)))==True).index
df.loc[markers]
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Once I got your code snippet working, I found that `(peak_df.date - peak_df.date.shift())` (obviously) gives the distance between what `find_peaks()` considers as peaks (that's 17 peaks in my code, not sure if that is necessarily the same "peak" as described in the question). How would you then "filter out the **low values** between the peaks at 5/7/2018 and5/19/2018"? Also: doesn't `find_peaks` require `width=..`? – Asmus Apr 25 '19 at 10:32
  • Im not really sure how it works. I just put it there as an example, you should modify it to your need. – Quang Hoang Apr 25 '19 at 11:47
  • @QuangHoang, somehow I am not able to get your code snippet running. It fails on the last line of code. Does it run at your end? – user308827 Apr 25 '19 at 16:00
  • @user308827 Yes, I tried it with the sample data. What's the error that you get? – Quang Hoang Apr 25 '19 at 16:01
  • @Asmus what do you mean by `low values`? I assumed it meant `values between peaks` since they are lower. Do you mean a threshold? How would they compare to the peaks? – Quang Hoang Apr 25 '19 at 16:04
  • 1
    @QuangHoang I wasn't sure what that meant either, I only quoted from the question above :-) Turns out (see edit above) that there should be only *one peak* in the end, thus I think we need to work on both our answers :-) – Asmus Apr 25 '19 at 16:09
  • @QuangHoang, in the last line of your code, I get this error: `pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match` – user308827 Apr 26 '19 at 03:28