drop duplicates with Nan

Question

I am using drop_duplicates on a df with Twitter users.

I want to delete duplicates. I have 2 lines which represent the exact same user. But Python does not recognize it as such for NaN values in the description field.

In other words, the test W.iloc[10,8]==W.iloc[11,8] return False (column 8 is description and lines and 11 are the duplicated rows). At the same time np.isnan(W.iloc[10,8]) returns a True as does np.isnan(W.iloc[11,8]).

As such, the function drop_duplictaes, does not work for these 2 lines.

Any idea of what is happening?

Here goes the 2 lines

thanks for helping

Mauro

                        id           created_at lang   screen_name  name  \
226080  710412633443332096  2016-03-17 10:29:05   en  Mich00299495  Mich   
226081  710412633443332096  2016-03-17 10:29:05   en  Mich00299495  Mich   

                location default_profile default_profile_image description  \
226080  Grenoble, France            True                  True         NaN   
226081  Grenoble, France            True                  True         NaN   

        followers_count  ...    geo_enabled  \
226080              2.0  ...          False   
226081              2.0  ...          False   

                                  profile_image_url_https protected time_zone  \
226080  https://abs.twimg.com/sticky/default_profile_i...     False       NaN   
226081  https://abs.twimg.com/sticky/default_profile_i...     False       NaN   

       verified favourites_count  statuses_count   sex name_F name_M  
226080    False              0.0             0.0  none   none   none  
226081    False              0.0             0.0  none   none   none

Check this http://stackoverflow.com/questions/19322506/pandas-dataframes-with-nans-equality-comparison — Piyush S. Wanare, Nov 23 '16 at 11:43
`NaN` values cannot be compared using equality operator, you need to consider whether to fill those with some sensible value, after which `drop_duplicates` will then work — EdChum, Nov 23 '16 at 12:17
For me drop_duplicates with `NaN` works perfectly. What return `print (W.iloc[10]==W.iloc[11])` False values are only for NaN values or there is at least one another `False`? You can test it by `print (W.iloc[10:12].T)`? — jezrael, Nov 23 '16 at 12:32
I must tell you that I am not proficient with Python. I am learning — Mauro Gentile, Nov 23 '16 at 12:43
print (W.iloc[0]==W.iloc[1]) id True created_at True lang True screen_name True name True location True default_profile True default_profile_image True description False followers_count True friends_count True geo_enabled True profile_image_url_https True protected True time_zone False verified True — Mauro Gentile, Nov 23 '16 at 12:46
Reformulating, it says that time_zone and description of the 2 lines are not the same — Mauro Gentile, Nov 23 '16 at 12:46
All the other fields are the same (name, id, screen name etc...) as it must be as it is actually the same user — Mauro Gentile, Nov 23 '16 at 12:47
By the way, I am working with python 3. Could this explain something? — Mauro Gentile, Nov 23 '16 at 12:52
Now I am even more confused: in the console, if i try: data = pd.DataFrame({'k1': ['one'] * 5 + ['two'] * 6, 'k2': [1, np.nan,1,np.nan, 2, 3, 3, 4,np.nan, 4,np.nan]}) and then data.drop_duplicates() It works properly and drops the duplicated line — Mauro Gentile, Nov 23 '16 at 13:17

drop duplicates with Nan

0 Answers0