0

I have a DataFrame that has been put together using different source files, and in some instances there are duplicated entries in the files, yet the amounts differ.

The below table is a representation of the data I am dealing with

index name recipient amount file year
0 company a recipient a 1003 File A 2020
1 company a recipient a 1000 File B 2020
2 company b recipient b 500 File A 2018
3 company a recipient a 760 File A 2017

In the above table, rows at index 0 & 1 are duplicates, and I want to drop the row at index 1

I haven't been able to get very far with writing any code, though I have an idea sketched out. It may not be very efficient, as the idea centers on looping through the rows.

At each row, if the following conditions are met:

  • Row 0 (recipient) = Row 1 (recipient)
  • Row 0 (year) = Row 1 (year)
  • Row 0 (file) != Row 1 (file)
  • (Row 0 (Amount) - Row 1 (Amount)) / Row 0 (amount) < 0.05

Then drop Row 1

The data will be sorted so that the check will always be done for the row below.

What is an effective way to apply the above logic?

**Update

I've tried to replicate the answer from Pandas compare next row based on conditions

recipient = df['recipient']
amount = df['amount']
file = df['file']
year = df['year']

for i in range(len(recipient)-1):
    print(i)
    if(recipient[i] == recipient[i+1] and year[i] == year[i+1] and file[i] != file[i+1] and ((amount[i] - amount[i+1])/amount[i]) < 0.05):
        del recipient[i]
        del amount[i]
        del file[i]
        del year[i]
        

new_df = pd.DataFrame({'recipient': recipient, 'amount': amount, 'file': file, 'year': year})

Except I am getting a KeyError: 10 error message

Mek
  • 15
  • 4

0 Answers0