pandas - drop_duplicates not working as expected

Question

Following an answer from here, I am trying to remove rows from one dataframe which are present in other dataframe.

It works well for this input:

csv1:

sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,141000,38.423251,-121.444489
Wed May 21 00:00:00 EDT 2008,146250,38.48742

csv2:

sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,146250,38.48742

Code:

>>> a = pd.read_csv('../test.csv', escapechar='\\')
>>> a
                      sale_date   price   latitude   longitude
0  Wed May 21 00:00:00 EDT 2008  141000  38.423251 -121.444489
1  Wed May 21 00:00:00 EDT 2008  146250  38.487420         NaN

>>> b = pd.read_csv('../test1.csv', escapechar='\\')
>>> b
                      sale_date   price  latitude  longitude
0  Wed May 21 00:00:00 EDT 2008  146250  38.48742        NaN

>>> pd.concat([a,b]).drop_duplicates(keep=False)
                      sale_date   price   latitude   longitude
0  Wed May 21 00:00:00 EDT 2008  141000  38.423251 -121.444489

This is working as expected. But as soon as there are other more rows in first csv, it doesnt work.

Scenario 2 with extra row in csv1

csv1:

sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,141000,38.423251,-121.444489
Wed May 21 00:00:00 EDT 2008,146250,38.48742
Wed May 21 00:00:00 EDT 2008,147308,38.658246a,-121.375469a

csv2:

sale_date,price,latitude,longitude
Wed May 21 00:00:00 EDT 2008,146250,38.48742

Code:

>>> a = pd.read_csv('../test.csv', escapechar='\\')
>>> a
                      sale_date   price    latitude     longitude
0  Wed May 21 00:00:00 EDT 2008  141000   38.423251   -121.444489
1  Wed May 21 00:00:00 EDT 2008  146250    38.48742           NaN
2  Wed May 21 00:00:00 EDT 2008  147308  38.658246a  -121.375469a

>>> b = pd.read_csv('../test1.csv', escapechar='\\')
>>> b
                      sale_date   price  latitude  longitude
0  Wed May 21 00:00:00 EDT 2008  146250  38.48742        NaN

>>> pd.concat([a,b]).drop_duplicates(keep=False)
                      sale_date   price    latitude     longitude
0  Wed May 21 00:00:00 EDT 2008  141000   38.423251   -121.444489
1  Wed May 21 00:00:00 EDT 2008  146250    38.48742           NaN
2  Wed May 21 00:00:00 EDT 2008  147308  38.658246a  -121.375469a
0  Wed May 21 00:00:00 EDT 2008  146250     38.4874           NaN

Notice that it also changed the latitude value for second duplicated in the merged row to 38.4874 from 38.48742

Am I missing something here or pandas has a bug?

I cannot run python now, but it is interesting that in the `concat` the `latitude` of `b` is truncated to `38.4874`. Therefore, it doesn't match the `38.48742` of the earlier row — Jondiedoop, Apr 16 '19 at 08:35
@Jondiedoop yes even I noticed that and cross checked the value in csv. It is indeed truncating — MohitC, Apr 16 '19 at 08:37
np.NaN == np.NaN is False, so probably you can replace the NaN values with some constant and then do the drop_duplicate may work... — durjoy, Apr 16 '19 at 08:39
I wonder whether the non-numeric value of index 2 could have anything to do with it. Could you add the `dtypes` of the `latitude` column in all cases? — Jondiedoop, Apr 16 '19 at 08:39

jezrael · Accepted Answer · 2019-04-16T08:48:08.047

1

Like @ayhan commented there is problem in a DataFrame are strings between numeric in columns latitude and longitude, so all columns are casted to strings.

In another DataFrame are columns by default casted to floats.

One possible solution is use dtype parameter for b DataFrame:

b = pd.read_csv('../test1.csv', escapechar='\\', dtype={'latitude':str, 'longitude':str})

df = pd.concat([a,b]).drop_duplicates(keep=False)
print (df)
                      sale_date   price    latitude     longitude
0  Wed May 21 00:00:00 EDT 2008  141000   38.423251   -121.444489
2  Wed May 21 00:00:00 EDT 2008  147308  38.658246a  -121.375469a

Or use to_numeric for columns in a:

a['latitude'] = pd.to_numeric(a['latitude'], errors='ignore')
a['longitude'] = pd.to_numeric(a['longitude'], errors='ignore')
df = pd.concat([a,b]).drop_duplicates(keep=False)
print (df)
                      sale_date   price    latitude     longitude
0  Wed May 21 00:00:00 EDT 2008  141000   38.423251   -121.444489
2  Wed May 21 00:00:00 EDT 2008  147308  38.658246a  -121.375469a

edited Apr 16 '19 at 08:48

answered Apr 16 '19 at 08:45

jezrael

822,522
95
1,334
1,252

Thnaks. is there a way for me to make every column's dtype as str? I really dont care about dtypes – MohitC Apr 16 '19 at 08:47
@MohitC - then use `a = pd.read_csv('../test.csv', escapechar='\\', dtype={'latitude':str, 'longitude':str})` – jezrael Apr 16 '19 at 08:49
thats what i am asking, is there a way to specify dtype as str for all columns as a single parameter? – MohitC Apr 16 '19 at 08:50
1

@MohitC - sure, use `a = pd.read_csv('../test.csv', escapechar='\\', dtype=str)` – jezrael Apr 16 '19 at 08:50

pandas - drop_duplicates not working as expected

1 Answers1