1

I need to go through each row in a dataframe looking at a column called 'source_IP_address', and look at the previous 100 rows, so that I can find if there are any rows with the same 'source_IP_address' and where another column states 'authentication failure'.

I have written some code that does this, as I couldn't use Pandas rolling over two columns. Problem is, it is not very fast and I want to know if there is a better way to do it?

function to find in the previous window of n rows, the number of matching axis values, together with number of attribute values in the attribute column

def check_window_for_match(df_w, window_size, axis_col, attr_col, attr_var):     

l = []
n_rows = df_w.shape[0]

for i in range(n_rows):

    # create a temp dataframe with the previous n rows including current row
    temp_df = df_w.iloc[i-(window_size-1):i+1]

    #print(temp_df.shape)

    # assign the current 'B' value as the axis variable
    current_value = df_w[axis_col].iloc[i]

    #print(current_value)
    #print(temp_df)

    # given the temp dataframe of previous window of n_rows, check axis matches against fails
    l_temp = temp_df.loc[(temp_df[axis_col] == current_value) & (temp_df[attr_col] == attr_var)].shape[0]
    l.append(l_temp)

return l

e.g.

 df_test = pd.DataFrame({'B': [0, 1, 2, np.nan, 4, 6, 7, 8, 10, 8, 7], 'C': [2, 10, 'fail', np.nan, 6, 7, 8, 'fail', 8, 'fail', 9]})

df_test

matches_list = check_window_for_match(df_test, window_size=3, axis_col='B', attr_col='C', attr_var='fail')

output: [0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0]

I want to know if my code is correct and if it is the best way to do it, or there is a better alternative.

javid
  • 339
  • 4
  • 14
  • This looks like it could be solved by defining your own roll function as described here https://stackoverflow.com/questions/38878917/how-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column – Tom C Oct 15 '19 at 13:52
  • 1
    Thanks, using stride is extremely fast as we operate on the byte level. However, the solution in the link above uses pd.concat(...) to create a new data frame with the window of values and then does a groupBy - this is very slow. I found it much faster to perform the logic on the window using np.where(...) – javid Nov 21 '19 at 11:41

0 Answers0