3

I have a dataFrame with more than 200 features, and I put a part of the dataset to show the problem:

   index  ID   X1   X2       Date1      Y1     
      0   2   324  634  2016-01-01     NaN    
      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      4   4   543  843  2017-02-01     NaN    
      5   5   523  843  2015-09-01     NaN    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN  

As you can see the rows are duplicated (in ID, X1, X2 and Date1) and I want to remove one of the rows which are similar in ID, X1, X2, Date1 and Y1 which contains NaN. So, my desired DataFrame should be:

   index  ID   X1   X2       Date1      Y1     

      1   2   324  634  2016-01-01   1224.0    
      3   4   543  843  2017-02-01     654    
      6   5   523  843  2015-09-01    1121.0
      7   6   500  897  2015-11-01     NaN 

Does any one know, how I can handle it?

cs95
  • 379,657
  • 97
  • 704
  • 746
Spedo
  • 355
  • 3
  • 13
  • Isn't this just `df = df.dropna(['Y1'])`? – cs95 Dec 27 '18 at 16:20
  • Actually no, because there may other rows which are not duplicated and contain NAN as well. I have edited my question. – Spedo Dec 27 '18 at 16:27
  • In that case, what about the row corresponding to X1 = 224 and X2 = 634? Shouldn't that also be present? – cs95 Dec 27 '18 at 16:35
  • Ahem... so, can you please tell me what happened to `0 2 324 634 2016-01-01 NaN ` and where it went? If you can explain where it went, that would help me give you an answer that actually works. – cs95 Dec 27 '18 at 16:43
  • row 0 should be removed, since it is similar to row 1. The only difference between row 0 and row 1 is the last feature 'Y1'. In other words, I am interested to remove those redundant rows which have NaN in 'Y1'. – Spedo Dec 27 '18 at 16:50
  • But row 0 has X1 = 324 and row 1 has X1 = 334. Is that a typo and they were actually supposed to be the same thing? – cs95 Dec 27 '18 at 16:51
  • Yes, true, it is my mistake. – Spedo Dec 27 '18 at 16:56

2 Answers2

2

Use sort_values on "Y1" to move NaNs to the bottom of the DataFrame, and then use drop_duplicates:

df2 = (df.sort_values('Y1', na_position='last')
         .drop_duplicates(['ID', 'X1', 'X2', 'Date1'], keep='first')
         .sort_index())

df2
       ID   X1   X2       Date1      Y1
index                                  
1       2  324  634  2016-01-01  1224.0
3       4  543  843  2017-02-01   654.0
6       5  523  843  2015-09-01  1121.0
7       6  500  897  2015-11-01     NaN
cs95
  • 379,657
  • 97
  • 704
  • 746
  • perfect, it works, tnx. Since my original DataFrame contains more tan 100 features, is there any way to do not use -- drop_duplicates(['ID', 'X1', 'X2', 'Date1']--? – Spedo Dec 27 '18 at 17:05
  • 1
    @Spedo Yes, you can do something like `drop_duplicates(df.columns.difference(['Y1']).tolist())` – cs95 Dec 27 '18 at 17:06
1

just use drop_duplicates function https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

df \
.orderBy(Y1).desc()) \
.drop_duplicates(subset='ID')
frankegoesdown
  • 1,898
  • 1
  • 20
  • 38