Drop duplicates with less precision

Question

I have a pandas DataFrame with string-columns and float columns I would like to use drop_duplicates to remove duplicates. Some of the duplicates are not exactly the same, because there are some slight differences in low decimal places. How can I remove duplicates with less precision?

Example:

import pandas as pd
df = pd.DataFrame.from_dict({'text': ['aaa','aaa','aaa','bb'], 'result': [1.000001,1.000000,2,2]})
df
     result text
0  1.000001  aaa
1  1.000000  aaa
2  2.000000  aaa
3  2.000000   bb

I would like to get

df_out = pd.DataFrame.from_dict({'text': ['aaa','aaa','bb'], 'result': [1.000001,2,2]})
df_out
     result text
0  1.000001  aaa
1  2.000000  aaa
2  2.000000   bb

Binning is an overcomplicated solution for this problem, but I'll share a link anyway: https://chrisalbon.com/python/pandas_binning_data.html — 000, May 29 '17 at 14:51

score 3 · Answer 1 · answered May 29 '17 at 14:47

3

round them

df.loc[df.round().drop_duplicates().index]

     result text
0  1.000001  aaa
2  2.000000  aaa
3  2.000000   bb

answered May 29 '17 at 14:47

Steven G

16,244
8
53
77

Thanks. putting 1 or `PRECISION` or something like that as an argument to `round()` might make your answer more useful, or more quickly, for many readers. – CPBL Mar 15 '23 at 17:52
I think this works just because `keep='first'` is the default in drop_duplicates; the result would be different if rows 0 and 1 were swapped. – David Mar 24 '23 at 14:16

score 3 · Accepted Answer · answered May 29 '17 at 14:50

You can use the function round with a given precision in order to round your df.

DataFrame.round(decimals=0, *args, **kwargs)

Round a DataFrame to a variable number of decimal places.

For example you can apply the round with two decimals by this:

df = df.round(2)

Also you can apply it on specific columns, for example:

df = df.round({'result': 2})

After the rounding you can use the function drop_duplictes

score 0 · Answer 3 · answered May 29 '17 at 15:00

0

Use numpy.trunc to get at the precision you are looking for. Use pandas duplicated to find which ones to keep.

df[~df.assign(result=np.trunc(df.result.values * 100)).duplicated()]

answered May 29 '17 at 15:00

piRSquared

285,575
57
475
624

Drop duplicates with less precision

3 Answers3

Linked