0

I am using drop_duplicates on a df with Twitter users.

I want to delete duplicates. I have 2 lines which represent the exact same user. But Python does not recognize it as such for NaN values in the description field.

In other words, the test W.iloc[10,8]==W.iloc[11,8] return False (column 8 is description and lines and 11 are the duplicated rows). At the same time np.isnan(W.iloc[10,8]) returns a True as does np.isnan(W.iloc[11,8]).

As such, the function drop_duplictaes, does not work for these 2 lines.

Any idea of what is happening?

Here goes the 2 lines

thanks for helping

Mauro

                        id           created_at lang   screen_name  name  \
226080  710412633443332096  2016-03-17 10:29:05   en  Mich00299495  Mich   
226081  710412633443332096  2016-03-17 10:29:05   en  Mich00299495  Mich   

                location default_profile default_profile_image description  \
226080  Grenoble, France            True                  True         NaN   
226081  Grenoble, France            True                  True         NaN   

        followers_count  ...    geo_enabled  \
226080              2.0  ...          False   
226081              2.0  ...          False   

                                  profile_image_url_https protected time_zone  \
226080  https://abs.twimg.com/sticky/default_profile_i...     False       NaN   
226081  https://abs.twimg.com/sticky/default_profile_i...     False       NaN   

       verified favourites_count  statuses_count   sex name_F name_M  
226080    False              0.0             0.0  none   none   none  
226081    False              0.0             0.0  none   none   none  
EdChum
  • 376,765
  • 198
  • 813
  • 562
Mauro Gentile
  • 1,463
  • 6
  • 26
  • 37

0 Answers0