I have a dataframe with a few thousand rows and multiple columns.
I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal. In other words, where the difference of the values of each column is below a threshold. This threshold can be different for each column. Also, I want to keep the row with the highest value based on column E.
I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.
cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value
new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
if len(
new_df[
(THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
(new_df['E'] == line['E']) &
(abs(new_df['C'] - line['C']) < THRESHOLD_C) &
((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
]
) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
new_df = pd.concat([new_df, pd.DataFrame([line])])
However, this code is too slow. Is there a better way to write this?
Example:
d = {
'A': [1, 1.5, 1.4, 7, 8],
'B': [10, 11, 11.5, 13, 14],
'C': [50, 50.5, 50.6, 60, 70],
'D': [5, 4, 5, 3, 2],
'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column
This would remove row 0
Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
row 2 has the highest value in column 'D' == 102.
'''
A B C D E
0 1.0 10.0 50.0 5 100
1 1.5 11.0 50.5 4 101
2 1.4 11.5 50.6 5 102
3 7.0 13.0 60.0 3 103
4 8.0 14.0 70.0 2 104
Output:
A B C D E
0 1.5 11.0 50.5 4 101
1 1.4 11.5 50.6 5 102
2 7.0 13.0 60.0 3 103
3 8.0 14.0 70.0 2 104