How to remove duplicate rows from a DataFrame, where the duplicates have NaNs?

Question

I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:

   index  ID   X1   X2       Date1      Y1     
      0   2   324  634  2016-01-01     NaN    
      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      4   4   543  843  2017-02-01     NaN    
      5   5   523  843  2015-09-01     NaN    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN

As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:

   index  ID   X1   X2       Date1      Y1     

      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN

Does any one know, how I can handle it?

Actually no, because there may other rows which are not duplicated and contain NAN as well. I have edited my question. — Spedo, Dec 27 '18 at 16:27
In that case, what about the row corresponding to X1 = 224 and X2 = 634? Shouldn't that also be present? — cs95, Dec 27 '18 at 16:35
Ahem... so, can you please tell me what happened to `0 2 324 634 2016-01-01 NaN ` and where it went? If you can explain where it went, that would help me give you an answer that actually works. — cs95, Dec 27 '18 at 16:43
row 0 should be removed, since it is similar to row 1. The only difference between row 0 and row 1 is the last feature 'Y1'. In other words, I am interested to remove those redundant rows which have NaN in 'Y1'. — Spedo, Dec 27 '18 at 16:50
But row 0 has X1 = 324 and row 1 has X1 = 334. Is that a typo and they were actually supposed to be the same thing? — cs95, Dec 27 '18 at 16:51

score 2 · Accepted Answer · answered Dec 27 '18 at 16:58

2

Use sort_values on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates:

df2 = (df.sort_values('Y1', na_position='last')
         .drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
         .sort_index())

df2
       ID   X1   X2       Date1      Y1
index                                  
1       2  324  634  2016-01-01  1224.0
3       4  543  843  2017-02-01   654.0
6       5  523  843  2015-09-01  1121.0
7       6  500  897  2015-11-01     NaN

answered Dec 27 '18 at 16:58

cs95

379,657
97
704
746

perfect, it works, tnx. Since my original DataFrame contains more tan 100 features, is there any way to do not use -- drop_duplicates(['ID', 'X1', 'X2', 'Date1']--? – Spedo Dec 27 '18 at 17:05
1

@Spedo Yes, you can do something like `drop_duplicates(df.columns.difference(['Y1']).tolist())` – cs95 Dec 27 '18 at 17:06

frankegoesdown · Answer 2 · 2018-12-27T16:42:00.580

1

just use drop_duplicates function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')

edited Dec 27 '18 at 16:42

answered Dec 27 '18 at 16:16

frankegoesdown

1,898
1
20
38

I have tried this code, but this code may remove the rows which includes Y1=1121.0. – Spedo Dec 27 '18 at 16:30
@Spedo add please sources of your dataframe – frankegoesdown Dec 27 '18 at 16:31
I have edited the DataFrame that I posted. That is exactly the source of my dataframe. – Spedo Dec 27 '18 at 16:34
@Spedo let me 5 minutes I'll fix the answer – frankegoesdown Dec 27 '18 at 16:35
1

@Spedo sort data before with groupBy and then drop duplicates but I don't know ASC or DESC try both examples – frankegoesdown Dec 27 '18 at 16:43
@Spedo does it helps? – frankegoesdown Dec 27 '18 at 16:47

How to remove duplicate rows from a DataFrame, where the duplicates have NaNs?

2 Answers2