1

I have a question about eliminating outliers from two-time series. One time series includes spot market prices and the other includes power outputs. The two series are from 2012 to 2016 and are both CSV files with the with a timestamp and then a value. As example for the power output: 2012-01-01 00:00:00,2335.2152646951617 and for the price: 2012-01-01 00:00:00,17.2

Because the spot market prices are very volatile and have a lot of outliers, I have filtered them. For the second time series, I have to delete the values with the same timestamp, which were eliminated in the time series of the prices. I thought about generating a list with the deleted values and writing a loop to delete the values with the same timestamp in the second time series. But so far that has not worked and I'm not really on. Does anyone have an idea?

My python code looks as follow:

import pandas as pd
import matplotlib.pyplot as plt

power_output = pd.read_csv("./data/external/power_output.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(power_output.head())
plt.plot(power_output)

spotmarket = pd.read_csv("./data/external/spotmarket_dhp.csv", delimiter=",", parse_dates=[0], index_col=[0])
print(spotmarket.head())

r = spotmarket['price'].pct_change().dropna() * 100
print(r)
plt.plot(r)

Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-2*(Q3-Q1)
q3 = Q3+2*(Q3-Q1)

a = r[r.between(q1, q3)]
print(a)
plt.plot(a)

Can somebody help me?

Ram Koti
  • 2,203
  • 7
  • 26
  • 36
Pyrmon55
  • 183
  • 1
  • 3
  • 14

2 Answers2

1

If your question is about how to compare two timestamps you can have a look at this.

Basically you could do:

out = r[~r.between(q1, q3)] # negation of your between to get the outliers
df=pd.merge(spotmarker,out,on=['date'],how="outer",indicator=True)
df=df[df['_merge']=='left_only']

Which is a merge operation that conserves only those rows that are only present in the left dataframe

m33n
  • 1,622
  • 15
  • 38
1

The following suggestion is based on an answer of mine from a previous post. You can solve your problem by merging both of your series and storing them in pandas dataframe. Then you can use any desired technique to identify and remove outliers. Take a look at the post mentioned above.

Here is my take on your particular problem using a snippet that can handle more than one series:


Since I don't have access to your data, the following snippet will produce two series where one of them has a distinctive outlier:

def sample(colname):

    base = 100
    nsample = 20
    sigma = 10

    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)

    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']

    df=df['y4'].to_frame()
    df.columns = [colname]

    return(df)

df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()

Series 1 - No outliers

enter image description here

Series 2 - A distinctive outlier

enter image description here

Now you can merge those series like this:

# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()

enter image description here

What is considered an outlier will depend full on the nature of your dataset. In this case, you can set the level for identifying outliers using sscipy.zscore(). In the following case, every observation with a difference that exceeds 3 is considered an outlier.

# A function for removing outliers
def noSpikes(df, level, keepFirst):

    # 1. Get some info about the original data:

    ##%%
    #df = df_merged
    #level = 3
    #keepFirst = True
    ##%%

    firstVal = df[:1]
    colNames = df.columns
    colNumber = len(df.columns)

    #cleanBy = 'Series1'

    # 2. Take the first difference and 
    df_diff = df.diff()

    # 3. Remove missing values
    df_clean = df_diff.dropna()

    # 4. Select a level for a Z-score to identify and remove outliers
    df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 5. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    # 6. 
    # df_keep will be missing some indexes.
    # Do the following if you'd like to keep those indexes
    # and, for example, fill missing values with the previous values
    df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)

    # 7. Keep only the original columns (drop the diffs)
    df_out = df_out.ix[:,:colNumber]

    # 8. Fill missing values
    df_complete = df_out.fillna(axis=0, method='ffill')

    # 9. Reset column names
    df_complete.columns = colNames

    # Keep the first value
    if keepFirst:
        df_complete.iloc[0] = firstVal.iloc[0]

    return(df_complete)

df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()

enter image description here

Let me know how this works out for you.


Here's the whole thing for an easy copy-paste:

# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(22)

# A function for noisy data with a trend element
def sample(colname):

    base = 100
    nsample = 20
    sigma = 10

    # Basic df with trend and sinus seasonality 
    trend1 = np.linspace(0,1, nsample)
    y1 = np.sin(trend1)
    dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
    df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
    df = df.set_index(['dates'])
    df.index = pd.to_datetime(df.index)

    # Gaussian Noise with amplitude sigma
    df['y2'] = sigma * np.random.normal(size=nsample)
    df['y3'] = df['y2'] + base + (np.sin(trend1))
    df['trend2'] = 1/(np.cos(trend1)/1.05)
    df['y4'] = df['y3'] * df['trend2']

    df=df['y4'].to_frame()
    df.columns = [colname]

    return(df)

df_sample1 = sample(colname = 'series1')
df_sample2 = sample(colname = 'series2')
df_sample2['series2'].iloc[10] = 800
df_sample1.plot()
df_sample2.plot()

# Merge dataframes
df_merged = pd.merge(df_sample1, df_sample2, how='outer', left_index=True, right_index=True)
df_merged.plot()

# A function for removing outliers
def noSpikes(df, level, keepFirst):

    # 1. Get some info about the original data:
    firstVal = df[:1]
    colNames = df.columns
    colNumber = len(df.columns)

    #cleanBy = 'Series1'

    # 2. Take the first difference and 
    df_diff = df.diff()

    # 3. Remove missing values
    df_clean = df_diff.dropna()

    # 4. Select a level for a Z-score to identify and remove outliers
    df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
    ix_keep = df_Z.index

    # 5. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df.loc[ix_keep]

    # 6. 
    # df_keep will be missing some indexes.
    # Do the following if you'd like to keep those indexes
    # and, for example, fill missing values with the previous values
    df_out = pd.merge(df_keep, df, how='outer', left_index=True, right_index=True)

    # 7. Keep only the original columns (drop the diffs)
    df_out = df_out.ix[:,:colNumber]

    # 8. Fill missing values
    df_complete = df_out.fillna(axis=0, method='ffill')

    # 9. Reset column names
    df_complete.columns = colNames

    # Keep the first value
    if keepFirst:
        df_complete.iloc[0] = firstVal.iloc[0]

    return(df_complete)

df_clean = noSpikes(df = df_merged, level = 3, keepFirst = True)
df_clean.plot()
vestland
  • 55,229
  • 37
  • 187
  • 305
  • Thank you for the idea! I think it's a good way but for my problem a litte bit to much. The solutin of m33n worked also good for me. – Pyrmon55 Jun 19 '18 at 20:03