0

I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows.

What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information.

I'd then like to store the the ID info. of the rows that meet these conditions.

So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach?

for index in range(len(df)):
    for row in range(index, len(df)):
        if df.iloc[index][1:30] == df.iloc[row][1:30]:
            print(df.iloc[index])

1 Answers1

0

As a general rule, you should always always try not to iterate over the rows of a DataFrame.

It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this:

df.duplicated(subset=['col1', 'col2', 'col3']) # etc.

Full example:

# Set up test df
from io import StringIO
sub_df = pd.read_csv(
    StringIO("""ID;col1;col2;col3
One;23;451;42;31
Two;24;451;42;54
Three;25;513;31;31"""
    ),
    sep=";"
)

enter image description here

Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above.

mask = sub_df.duplicated(["col1", "col2"])

This looks like:

Now, filter using the mask.

sub_df["ID"][sub_df.duplicated(["col1", "col2"])]

enter image description here

Of course, you can do the last two steps in one line.

butterflyknife
  • 1,438
  • 8
  • 17
  • Would you have any advice on how to deal with the two other fields (namely ID & Time)? Basically, I need to test that for the duplicate rows, there are at least two different ID's, and a time difference of less than three minutes. – Ronnie_drew Feb 28 '22 at 14:19
  • 1
    Thanks for the accept. I think the answer to your follow up would be more than I can fit into a comment. Two tips that I can give are to make sure that the time stamps really are datetime objects (ie., not strings), and to look into relativedelta. – butterflyknife Feb 28 '22 at 19:34