I am using drop_duplicates on a df with Twitter users.
I want to delete duplicates. I have 2 lines which represent the exact same user. But Python does not recognize it as such for NaN values in the description field.
In other words, the test W.iloc[10,8]==W.iloc[11,8]
return False (column 8 is description and lines and 11 are the duplicated rows). At the same time np.isnan(W.iloc[10,8])
returns a True as does np.isnan(W.iloc[11,8])
.
As such, the function drop_duplictaes, does not work for these 2 lines.
Any idea of what is happening?
Here goes the 2 lines
thanks for helping
Mauro
id created_at lang screen_name name \
226080 710412633443332096 2016-03-17 10:29:05 en Mich00299495 Mich
226081 710412633443332096 2016-03-17 10:29:05 en Mich00299495 Mich
location default_profile default_profile_image description \
226080 Grenoble, France True True NaN
226081 Grenoble, France True True NaN
followers_count ... geo_enabled \
226080 2.0 ... False
226081 2.0 ... False
profile_image_url_https protected time_zone \
226080 https://abs.twimg.com/sticky/default_profile_i... False NaN
226081 https://abs.twimg.com/sticky/default_profile_i... False NaN
verified favourites_count statuses_count sex name_F name_M
226080 False 0.0 0.0 none none none
226081 False 0.0 0.0 none none none