I have a DataFrame that has been put together using different source files, and in some instances there are duplicated entries in the files, yet the amounts differ.
The below table is a representation of the data I am dealing with
index | name | recipient | amount | file | year |
---|---|---|---|---|---|
0 | company a | recipient a | 1003 | File A | 2020 |
1 | company a | recipient a | 1000 | File B | 2020 |
2 | company b | recipient b | 500 | File A | 2018 |
3 | company a | recipient a | 760 | File A | 2017 |
In the above table, rows at index 0 & 1 are duplicates, and I want to drop the row at index 1
I haven't been able to get very far with writing any code, though I have an idea sketched out. It may not be very efficient, as the idea centers on looping through the rows.
At each row, if the following conditions are met:
- Row 0 (recipient) = Row 1 (recipient)
- Row 0 (year) = Row 1 (year)
- Row 0 (file) != Row 1 (file)
- (Row 0 (Amount) - Row 1 (Amount)) / Row 0 (amount) < 0.05
Then drop Row 1
The data will be sorted so that the check will always be done for the row below.
What is an effective way to apply the above logic?
**Update
I've tried to replicate the answer from Pandas compare next row based on conditions
recipient = df['recipient']
amount = df['amount']
file = df['file']
year = df['year']
for i in range(len(recipient)-1):
print(i)
if(recipient[i] == recipient[i+1] and year[i] == year[i+1] and file[i] != file[i+1] and ((amount[i] - amount[i+1])/amount[i]) < 0.05):
del recipient[i]
del amount[i]
del file[i]
del year[i]
new_df = pd.DataFrame({'recipient': recipient, 'amount': amount, 'file': file, 'year': year})
Except I am getting a KeyError: 10 error message