You can use numpy.where
to select a value from one column (of the default values) if the condition for that row (is.na) matches, else select values from some other column (in this case, the same column).
import pandas as pd
import numpy as np
from io import StringIO
TESTDATA = """
group_id,start_time,end_time
1,NA,20000
1,40000,20000
1,30000,NA
2,NA,35000
2,45000,22000
2,50000,21000
2,45000,NA
"""
df = pd.read_csv(StringIO(TESTDATA), sep=",") # parse your sample data
In this case, the default values you asked for are the first/last of each group, so we will need to construct columns for these:
start_time_last = df.groupby("group_id")["start_time"].last()
end_time_first = df.groupby("group_id")["end_time"].first()
merged = df.join(start_time_last, on='group_id', how='left', rsuffix='_last').join(end_time_first, on='group_id', how='left', rsuffix='_first')
Now we can use np.where
to fill in the na
values:
merged["start_time"] = np.where(pd.isna(merged["start_time"]), merged["start_time_last"], merged["start_time"])
merged["end_time"] = np.where(pd.isna(merged["end_time"]), merged["end_time_first"], merged["end_time"])
print(merged)
Gives:
group_id start_time end_time start_time_last end_time_first
0 1 30000.0 20000.0 30000.0 20000.0
1 1 40000.0 20000.0 30000.0 20000.0
2 1 30000.0 20000.0 30000.0 20000.0
3 2 45000.0 35000.0 45000.0 35000.0
4 2 45000.0 22000.0 45000.0 35000.0
5 2 50000.0 21000.0 45000.0 35000.0
6 2 45000.0 35000.0 45000.0 35000.0