1

I want to fill NA in my dataset based on the value from the other row in the same group.

The data looks like this.

 group_id, start_time, end_time  
    1, NA, 20000
    1, 40000, 20000
    1, 30000, NA
    2, NA, 35000
    2, 45000, 22000
    2, 50000,21000
    2, 45000, NA

I want to get this result:

 group_id, start_time, end_time  
        1, 30000, 20000
        1, 40000, 20000
        1, 30000, 20000
        2, 45000, 35000
        2, 45000, 22000
        2, 50000,21000
        2, 45000, 35000

So the first and last value on each group are same on both start_time, and end_time .

Dan
  • 45,079
  • 17
  • 88
  • 157
nne
  • 37
  • 3
  • So you want to fill it in with the group minimum for start_time and the group maximum for end_time? – Dan May 28 '20 at 10:57
  • So the value of start_time and end_time in both first and last row of group are the same. As can be seen that the first NA in group _id 1 is 30000 , which is based on the third row. – nne May 28 '20 at 11:01
  • Hi, welcome to SO! Please, take a look at [how-to-ask](https://stackoverflow.com/help/how-to-ask). It will be useful for you and the community. Tell us what you have tried so far. – Francesco May 28 '20 at 11:11
  • @FrancescoLucianò. Thanks for your reminder. Noted! – nne May 28 '20 at 11:13
  • Does this answer your question? [Pandas: filling missing values by mean in each group](https://stackoverflow.com/questions/19966018/pandas-filling-missing-values-by-mean-in-each-group) – Dan May 28 '20 at 11:22

3 Answers3

1

You can do this using fillna, groupby, tranform and the first or last aggregation functions, as explained in this answer

df['start_time'] = df['start_time'].fillna(df.groupby('group_id')['start_time'].transform('last'))
df['end_time'] = df['end_time'].fillna(df.groupby('group_id')['end_time'].transform('first'))
Dan
  • 45,079
  • 17
  • 88
  • 157
0

You can use numpy.where to select a value from one column (of the default values) if the condition for that row (is.na) matches, else select values from some other column (in this case, the same column).

import pandas as pd
import numpy as np
from io import StringIO

TESTDATA = """
group_id,start_time,end_time
1,NA,20000
1,40000,20000
1,30000,NA
2,NA,35000
2,45000,22000
2,50000,21000
2,45000,NA
"""
df = pd.read_csv(StringIO(TESTDATA), sep=",") # parse your sample data

In this case, the default values you asked for are the first/last of each group, so we will need to construct columns for these:

start_time_last = df.groupby("group_id")["start_time"].last()
end_time_first = df.groupby("group_id")["end_time"].first()
merged = df.join(start_time_last, on='group_id', how='left', rsuffix='_last').join(end_time_first, on='group_id', how='left', rsuffix='_first')

Now we can use np.where to fill in the na values:

merged["start_time"] = np.where(pd.isna(merged["start_time"]), merged["start_time_last"], merged["start_time"])
merged["end_time"] = np.where(pd.isna(merged["end_time"]), merged["end_time_first"], merged["end_time"])
print(merged)

Gives:

   group_id  start_time  end_time  start_time_last  end_time_first
0         1     30000.0   20000.0          30000.0         20000.0
1         1     40000.0   20000.0          30000.0         20000.0
2         1     30000.0   20000.0          30000.0         20000.0
3         2     45000.0   35000.0          45000.0         35000.0
4         2     45000.0   22000.0          45000.0         35000.0
5         2     50000.0   21000.0          45000.0         35000.0
6         2     45000.0   35000.0          45000.0         35000.0
anjsimmo
  • 704
  • 5
  • 18
  • 1
    You can use `transform` to avoid needing to do all that joining. Once you have that, you can just use fillna to fill in the values. – Dan May 28 '20 at 11:21
  • @Dan Didn't notice your answer until after I posted, but your solution using `transform` is much more elegant! – anjsimmo May 28 '20 at 11:25
-1

You could make a loop to iterate over that and if the value is NA change it

Tiago Oliveira
  • 47
  • 1
  • 12