Checking Padded data in Pandas Dataframe on specific columns

Question

I have a DataFrame that looks like this:

import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df

I would like to run a test on certain columns in this DataFrame only, all column names in this list:

check = {'1M','SP'}
print check

For these columns, I would like to know when the values in either of these columns is the same as the value on the previous day. So the output dataframe should return series date and a Comment such as (for the example in this case:)

output_data = {'Series_Date':['2017-03-14','2017-03-15'],'Comment':["Value for 1M data is same as previous day","Value for SP data is same as previous day"]}
output_data_df = pd.DataFrame(output_data,columns = ['Series_Date','Comment'])
print output_data_df

Could you please provide some assistance how to deal with this?

AndreyF · Answer 1 · 2017-03-16T12:34:58.683

I'm not sure it is the most clean way to do it. However, it works

check = {'1M', 'SP'}
prev_dict = {c: None for c in check}

def check_prev_value(row):
    global prev_dict
    msg = ""
    # MAYBE add clause to check if both are equal
    for column in check:
        if row[column] == prev_dict[column]:
            msg = 'Value for %s data is same as previous day' % column
        prev_dict[column] = row[column]
    return msg

df['comment'] = df.apply(check_prev_value, axis=1)

output_data_df = df[df['comment'] != ""]
output_data_df = output_data_df[["Series_Date", "comment"]].reset_index(drop=True)

For your input:

  Series_Date    SP    1M  3M
0  2017-03-10  35.6  -7.8  24
1  2017-03-13  56.7  56.0 -31
2  2017-03-14  41.0  56.0  53
3  2017-03-15  41.0  -3.4   5

The output is:

  Series_Date                                    comment
0  2017-03-14  Value for 1M data is same as previous day
1  2017-03-15  Value for SP data is same as previous day

Thanks but what if I were to check on any other columns such as just SP, or SP and 3M? I want my columns to be tested to be specified as per the columns in the 'check' list — sg91, Mar 16 '17 at 11:12
I updated the code. Now it will search for columns that appear in check — AndreyF, Mar 16 '17 at 12:35

td39 · Answer 2 · 2017-03-16T12:30:07.170

The following does more or less what you want. Columns item_ok are added to the original dataframe specifying if the value is the same as previous day or not:

from datetime import timedelta
df['Date_diff'] = pd.to_datetime(df['Series_Date']).diff()
for item in check:
    df[item+'_ok'] = (df[item].diff() == 0) & (df['Date_diff'] == timedelta(1))
df_output = df.loc[(df[[item + '_ok' for item in check]]).any(axis=1)]

score 0 · Answer 3 · edited May 23 '17 at 12:17

Reference: this answer

cols = ['1M','SP']
for col in cols:
    df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()

Output column will have an integer greater than zero when a duplicate is found.

df:

  Series_Date    SP    1M  3M  1M_dup  SP_dup
0  2017-03-10  35.6  -7.8  24       0       0
1  2017-03-13  56.7  56.0 -31       0       0
2  2017-03-14  41.0  56.0  53       1       0
3  2017-03-15  41.0  -3.4   5       0       1

Slice to find dups:

col = 'SP'
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']]

dup_df:

  Series_Date  SP_dup
3  2017-03-15       1

Here is a function version of the above (with the added feature of handling multiple columns):

import pandas as pd
import numpy as np

def find_repeats(df, col_list, date_col='Series_Date'):
    dummy_df = df[[date_col, *col_list]].copy()
    dates = dummy_df[date_col]
    date_series = []
    code_series = []
    if len(col_list) > 1:
        for col in col_list:
            these_repeats = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount().values
            repeat_idx = list(np.where(these_repeats > 0)[0])
            date_arr = dates.iloc[repeat_idx]
            code_arr = [col] * len(date_arr)
            date_series.extend(list(date_arr))
            code_series.extend(code_arr)
        return pd.DataFrame({date_col: date_series, 'col_dup': code_series}).sort_values(date_col).reset_index(drop=True)
    else:
        col = col_list[0]
        dummy_df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
        return dummy_df[dummy_df[col + '_dup'] > 0].reset_index(drop=True)

find_repeats(df, ['1M'])

  Series_Date    1M  1M_dup
0  2017-03-14  56.0       1

find_repeats(df, ['1M', 'SP'])

  Series_Date col_dup
0  2017-03-14      1M
1  2017-03-15      SP

And here is another way using pandas diff:

def find_repeats(df, col_list, date_col='Series_Date'):
    code_list = []
    dates = list()

    for col in col_list:
        these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values
        code_arr = [col] * len(these_dates)
        dates.extend(list(these_dates))
        code_list.extend(code_arr)
    return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)

Checking Padded data in Pandas Dataframe on specific columns

3 Answers3