1

I have a dataframe with a few thousand rows and multiple columns.

I want to reduce the size of this dataframe by removing rows which values of columns A, C and D are too similar, and column D is equal. In other words, where the difference of the values of each column is below a threshold. This threshold can be different for each column. Also, I want to keep the row with the highest value based on column E.

I have a code that populates a new dataframe and checks if each row of the old dataframe is too similar to anything already present in the new dataframe.

cols = [list-of-column-names]
df = pd.DataFrame(l, columns=cols) # l is a list of thousands of lists with values to populate the dataframe
df.sort_values(by='E', ascending=False, inplace=True) # Sort based on the column I want to keep the highest value

new_df = pd.DataFrame(columns=cols) # Create new dataframe
for i, line in df.iterrows(): # Iterate over old dataframe
    if len(
            new_df[
                (THRESHOLD_A1 < abs(1e6 * (new_df['A'] - line['A']) / new_df['A'])) & (
                        abs(1e6 * (new_df['A'] - line['A']) / new_df['A']) < THRESHOLD_A2) &
                (new_df['E'] == line['E']) &
                (abs(new_df['C'] - line['C']) < THRESHOLD_C) &
                ((abs(new_df['D'] - line['D']) / new_df['D']) < THRESHOLD_D)
            ]
    ) == 0: # If no row in the new dataframe was found, then append this row to new dataframe
        new_df = pd.concat([new_df, pd.DataFrame([line])])

However, this code is too slow. Is there a better way to write this?


Example:

d = {
  'A': [1, 1.5, 1.4, 7, 8],
  'B': [10, 11, 11.5, 13, 14],
  'C': [50, 50.5, 50.6, 60, 70],
  'D': [5, 4, 5, 3, 2],
  'E': [100, 101, 102, 103, 104]
}
df = pd.DataFrame(d)
    
THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1
'''
# Values are too similar if absolute difference between values of same column is below threshold
# values in column D needs to be the same
# If two rows are too similar, preserve the one with highest value in E column

This would remove row 0

Rational
row 0 ['E'] == row 2 ['E']
abs(row 0 ['A'] - row 2 ['A']) == 0.4 <= THRESHOLD_A
abs(row 0 ['B'] - row 2 ['B']) == 1.5 <= THRESHOLD_B
abs(row 0 ['C'] - row 2 ['C']) == 0.6 <= THRESHOLD_C
    
row 2 has the highest value in column 'D' == 102.
'''

     A     B     C  D    E
0  1.0  10.0  50.0  5  100
1  1.5  11.0  50.5  4  101
2  1.4  11.5  50.6  5  102
3  7.0  13.0  60.0  3  103
4  8.0  14.0  70.0  2  104

Output:

     A     B     C  D    E
0  1.5  11.0  50.5  4  101
1  1.4  11.5  50.6  5  102
2  7.0  13.0  60.0  3  103
3  8.0  14.0  70.0  2  104
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Please provide something that we can test your code with. As is it's pretty much impossible to debug arbitrary code without proper input. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – BeRT2me Jul 19 '22 at 04:43
  • I added an example to make things more clear. – Állan Ferrari Jul 19 '22 at 05:28
  • What would be the expected output of your example DataFrame? – BeRT2me Jul 19 '22 at 05:34
  • It would remove row 0. I provided the resulting dict. – Állan Ferrari Jul 19 '22 at 05:38

1 Answers1

0

One approach is to round number to specific number of floating point then apply group by on the result. The problem is that the thresholds can not be set very arbitrary way. Demonstration:

df = pd.DataFrame({'A':[1.514, 1.54, 4.86], 'B': [1.51, 3.58, 4.01], 'C': [1.21, 8.52,4.21], 'E': [5,10,20]})
es = df.groupby(df['A'].round(1)).apply(lambda x: x['E'].max())
df[df['E'].isin(es)]

Output:

    A   B   C   E
1   1.54    3.58    8.52    10
2   4.86    4.01    4.21    20

Here as the A value of two rows are similar, based on the column E we keep the one with the max value. A better option is to use pd.cut:

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
pd.cut(df['B'], bins = t2)

which provides you groups. Based on your new sample data:

0    (0.993, 2.0]
1    (0.993, 2.0]
2    (0.993, 2.0]
3      (6.0, 7.0]
4      (7.0, 8.0]
Name: A, dtype: category

Demonstration on your sample:

THRESHOLD_A = 1
THRESHOLD_B = 2
THRESHOLD_C = 1

d = {'A': [1, 1.5, 1.4, 7, 8],
     'B': [10, 11, 11.5, 13, 14],
     'C': [50, 50.5, 50.6, 60, 70],
     'D': [5, 4, 5, 3, 2],
     'E': [100, 101, 102, 103, 104]
}

df = pd.DataFrame(d)
cs = df.columns

t1 = int(df['A'].max()-df['A'].min()/THRESHOLD_A)
c1 = pd.cut(df['A'], bins = t1)
t2 = int((df['B'].max()-df['B'].min())/THRESHOLD_B)
c2 = pd.cut(df['B'], bins = t2)
t3 = int((df['C'].max()-df['C'].min())/THRESHOLD_C)
c3 = pd.cut(df['C'], bins = t3)

df['c1'] = c1
df['c2'] = c2
df['c3'] = c3
t = df.groupby(['c1', 'c2', 'c3', 'D'])['E'].apply(lambda x: x.max()).reset_index()['E']
es = t[t.notna()]

df[df['E'].isin(es)][cs]

Output:

    A   B   C   D   E
1   1.5 11.0    50.5    4   101
2   1.4 11.5    50.6    5   102
3   7.0 13.0    60.0    3   103
4   8.0 14.0    70.0    2   104
keramat
  • 4,328
  • 6
  • 25
  • 38