pandas: How to remove "duplicate" rows whose C1 column is within a tolerance, and whose C2 column is maximal?

Question

I have a dataframe:

load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
1,1576147339.641,124219
9,1576147339.662,124219
8,1576147339.663,124219
7,1576147339.663,124219
6,1576147339.663,124219
5,1576147339.663,124219
4,1576147339.663,124219
3,1576147339.663,124219
2,1576147339.663,124219
1,1576147339.663,124219
0,1576147339.663,124219
0,1576147339.673,124219
3,1576147341.567,124221
2,1576147341.568,124221
1,1576147341.569,124221
0,1576147341.57,124221
4,1576147341.581,124221
3,1576147341.581,124221

I want to remove all rows that are within some tolerance from one another, in the 'timestamp' column except the one that has the largest 'load' column.

In the above example, if tolerance=0.01, that would leave us with

load,timestamp,timestr
0,1576147339.49,124219
0,1576147339.502,124219
2,1576147339.637,124219
9,1576147339.662,124219
0,1576147339.673,124219
3,1576147341.567,124221
4,1576147341.581,124221

The maximal value of 'load' doesn't have to be the 1st one!

score 1 · Answer 1 · answered Dec 15 '19 at 12:55

1

Idea is round values by values >1 created by multiple by tolerance divided by 1 and pass to groupby for aggregate max:

tolerance=0.01

df = df.groupby(df['timestamp'].mul(1/tolerance).round()).max().reset_index(drop=True)
print (df)
   load     timestamp  timestr
0     0  1.576147e+09   124219
1     0  1.576147e+09   124219
2     2  1.576147e+09   124219
3     9  1.576147e+09   124219
4     0  1.576147e+09   124219
5     3  1.576147e+09   124221
6     4  1.576147e+09   124221

answered Dec 15 '19 at 12:55

jezrael

822,522
95
1,334
1,252

`diff_before = np.diff(df[self._time_column_name].values)` , `self._dfs[i] = df.groupby(df[self._time_column_name].mul(1/self._time_tolerance).round()).max().reset_index(drop=True)`, `diff_after = np.diff(self._dfs[i][self._time_column_name].values)`, diff_after.min = 0.006, which is smaller than tolerance. The dataset looks like the above one, only larger. – Gulzar Dec 15 '19 at 13:12
@Gulzar Unfortunately not so easy, performance is important? What is size of DataFrame in real data? – jezrael Dec 16 '19 at 19:39
can be some 10000 samples, maybe some 100000 – Gulzar Dec 16 '19 at 19:48

Valdi_Bo · Answer 2 · 2019-12-15T14:02:04.473

Rounding is susceptible to such a problem that there can be 2 rows with fractional parts e.g. 0.494 and 0.502. The first will be rounded to 0.49 and the second to 0.50, so they will be in different groups, despite the fact that they are less than 0.01 apart.

So my proposition is to do the job (compute result DataFrame) by iteration:

result = pd.DataFrame(columns=df.columns)
wrk = df.sort_values('timestamp')
threshold = 0.01
while wrk.index.size > 0:
    tMin = wrk.iloc[0, 1]  # min timestamp
    grp = wrk[wrk.timestamp <= tMin + threshold]
    row = grp.nlargest(1, 'load')  # max load
    result = result.append(row)
    wrk.drop(grp.index, inplace=True)

To confirm my initial remark, change the fractional part of timestamp in the first row to 0.494. For readability, I also "shortened" the integer part.

My solution returns:

   load  timestamp timestr
0     0   7339.494  124219
2     2   7339.637  124219
4     9   7339.662  124219
14    0   7339.673  124219
15    3   7341.567  124221
19    4   7341.581  124221

whereas the other solution returns:

   load  timestamp  timestr
0     0   7339.494   124219
1     0   7339.502   124219
2     2   7339.641   124219
3     9   7339.663   124219
4     0   7339.673   124219
5     3   7341.570   124221
6     4   7341.581   124221

Thanks! does pandas not have something build in for something like this? Also, what does `wrk` stand for? — Gulzar, Dec 15 '19 at 17:04
Unfortunately, *Pandas* has no built-in function to perform such operation. It has *groupby* but here there is no possibility to pass the grouping criteria. The only method is "iterative" with setting criteria at the beginning of each iteration and dropping the "processed" rows. *wrk* is a work (auxiliary) DataFrame. I introduced it in order not to change the original DataFrame. — Valdi_Bo, Dec 15 '19 at 17:13
what about some inplace function that "rounds" each number by the tolerance, then groupby? — Gulzar, Dec 15 '19 at 20:32

pandas: How to remove "duplicate" rows whose C1 column is within a tolerance, and whose C2 column is maximal?

2 Answers2