Iterating through and comparing two Pandas Dataframes

Question

I've been working on a little project to sift through backup job reports and was using a couple of Pandas Dataframs to attempt to accomplish this. Now, I understand that manually iterating through a dataframe isn't ideal - but I'm not sure if there's a better method considering what I'm trying to achieve. I'm also rather new to Python and Pandas in general.

So I have a list of backup jobs, uniquely defined by their backup type, system name, and system type. This list is stored in objects_df.

Next, I have a list of every job run and its success or failure status. This is stored in report_df and is a few thousand rows.

I want to 'flag' a job in report_df once it has failed 3 consecutive times. I only care about 'flagging' the first time that happens.

And here's the ugly piece of code I'm trying to use to make this all happen:

for index, row in objects_df.iterrows():
    count = 0
    for index2, row in report_df.iterrows():
        if(objects_df['Task Type'][index].equals(report_df['Task Type'][index2]) and objects_df['Object Name'][index].equals(report_df['Object Name'][index2]) and objects_df['Object Type'][index].equals(report_df['Object Type'][index2])):
            if(row['Task Status'] == "Succeeded" and count < 3):
                count = 0
            if(row['Task Status'] == "Failed"):
                count += 1
            if(count == 3):
                report_df.at[index2, 'flag'] = True
                break

As you can see, it goes through my list of objects and for every match it finds in the report, it starts counting failures. Once it reaches 3 consecutive failures it flags it in the report_df.

Unfortunately, this code has been giving me all sorts of problems and doesn't really seem to mark the expected rows as 'flag = true'. Instead, I'm getting duplicate rows and flags on successful runs getting marked as 'true'.

Is there a way to clean this code up or perhaps a better way to approach this project altogether? Even if I do get this working, I'm afraid it'll take forever to go through all the data.

Thanks!

Welcome to stack overflow! It would help to see a sample of your input dataframes and your expected output so that we can better understand your task and how to help, see [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — G. Anderson, Jul 08 '21 at 17:30
Why not combine "backup type, system name, and system type" into a single "id" and use it to count with something like len(df[df["id"] == "bkX-sysX-systX"]) for the set of "id" which you can get with df["id"].unique() — braulio, Jul 08 '21 at 17:49

score 0 · Answer 1 · answered Jul 08 '21 at 17:27

Have you investigated Dataframe.compare: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

I think your task is most easily split into 2 steps:

compare the two dataframes
process the resulting dataframe for instances of 3 or more non-equal comparisons

The second step being a different problem. Similar question: https://stackoverflow.com/a/41488980/9190640

Iterating through and comparing two Pandas Dataframes

1 Answers1