1

I have a large pandas dataframe with several NaN values in different columns. Each NaN value have an associated ID, I would like to impute those NaN values with the associated id value. For example, consider:

ID  COL
1   23
1   NaN
1   NaN
1   NaN
1   NaN
2   21
2   NaN
2   NaN
2   NaN
3   25
3   NaN
3   NaN

As you can see 1 is associated to 23, therefore all the ids that have 1 must be imputed with 23 and so one for the other cases. For example, the expected output would be:

ID  COL
1   23
1   23
1   23
1   23
1   23
2   21
2   21
2   21
2   21
3   25
3   25
3   25

How can I do such operation with pandas?, my problem is that I do not know how to handle the previous value and replace it with the its id.

UPDATE

After reading the answers from this question and other associated questions I tried to:

df.sort_values(['ID','COL']).ffill()

However is not working. It is not replacing the values with those associated to the IDs, the reason is that maybe my COL values are strings. Any idea of how to deal with this?

john doe
  • 2,233
  • 7
  • 37
  • 58
  • 1
    `df.COL.ffill() ` – BENY Apr 23 '18 at 17:51
  • 3
    @Wen Almost buddy, it needs a groupby too – cs95 Apr 23 '18 at 17:51
  • 1
    @ayhan don't let it be said I'm not a man of my word. As soon as the duplicate is found, the answer is deleted, thanks – cs95 Apr 23 '18 at 17:59
  • @ayhan 1. It was their choice to leave. 2. They'd been answering duplicates for a while until I spoke to them. 3. I, um, searched for a duplicate, couldn't find one, and decided I may as well answer before someone else. – cs95 Apr 23 '18 at 18:08
  • I also look for the solution before posting this question. Maybe the title of the other question do not help. – john doe Apr 23 '18 at 18:10
  • @ayhan I'm not saying I'm right, or you're wrong. I acknowledge everything you just said to me, which is why I deleted the answer. Sure, I need to be consistent, and deleting my answer was my way of showing that. I'm sorry if Ami Tavory took offence to whatever I said to him, but he's not the first person I've criticised for answering dupes, nor is he going to be the last. – cs95 Apr 23 '18 at 18:11
  • Actually I just tried the solution and is not working. – john doe Apr 23 '18 at 18:29

1 Answers1

4

sort_values with ffill

df.COL=df.sort_values(['ID','COL']).COL.ffill()
Out[381]: 
    ID   COL
0    1  23.0
1    1  23.0
2    1  23.0
3    1  23.0
4    1  23.0
5    2  21.0
6    2  21.0
7    2  21.0
8    2  21.0
9    3  25.0
10   3  25.0
11   3  25.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks for the help, but why is it transforming the values into floats? – john doe Apr 23 '18 at 17:53
  • 1
    @johndoe because nans are floats, so the rest of your numbers become floats. Try `df.sort_values(['ID','COL']).ffill().astype(int)` – cs95 Apr 23 '18 at 18:00
  • Thanks for the help again... I just tried this solution and is not working, note that my pandas dataframe have more columns (actually there are 20000 columns), but I just want to do this operation for two columns of the dataframe. – john doe Apr 23 '18 at 18:31
  • 1
    @johndoe check the update – BENY Apr 23 '18 at 18:40