2

I have a pandas DataFrame with string-columns and float columns I would like to use drop_duplicates to remove duplicates. Some of the duplicates are not exactly the same, because there are some slight differences in low decimal places. How can I remove duplicates with less precision?

Example:

import pandas as pd
df = pd.DataFrame.from_dict({'text': ['aaa','aaa','aaa','bb'], 'result': [1.000001,1.000000,2,2]})
df
     result text
0  1.000001  aaa
1  1.000000  aaa
2  2.000000  aaa
3  2.000000   bb

I would like to get

df_out = pd.DataFrame.from_dict({'text': ['aaa','aaa','bb'], 'result': [1.000001,2,2]})
df_out
     result text
0  1.000001  aaa
1  2.000000  aaa
2  2.000000   bb
Make42
  • 12,236
  • 24
  • 79
  • 155
  • Binning is an overcomplicated solution for this problem, but I'll share a link anyway: https://chrisalbon.com/python/pandas_binning_data.html – 000 May 29 '17 at 14:51

3 Answers3

3

round them

df.loc[df.round().drop_duplicates().index]

     result text
0  1.000001  aaa
2  2.000000  aaa
3  2.000000   bb
Steven G
  • 16,244
  • 8
  • 53
  • 77
  • Thanks. putting 1 or `PRECISION` or something like that as an argument to `round()` might make your answer more useful, or more quickly, for many readers. – CPBL Mar 15 '23 at 17:52
  • I think this works just because `keep='first'` is the default in drop_duplicates; the result would be different if rows 0 and 1 were swapped. – David Mar 24 '23 at 14:16
3

You can use the function round with a given precision in order to round your df.

DataFrame.round(decimals=0, *args, **kwargs)

Round a DataFrame to a variable number of decimal places.

For example you can apply the round with two decimals by this:

df = df.round(2)

Also you can apply it on specific columns, for example:

df = df.round({'result': 2})

After the rounding you can use the function drop_duplictes

omri_saadon
  • 10,193
  • 7
  • 33
  • 58
0

Use numpy.trunc to get at the precision you are looking for. Use pandas duplicated to find which ones to keep.

df[~df.assign(result=np.trunc(df.result.values * 100)).duplicated()]
piRSquared
  • 285,575
  • 57
  • 475
  • 624