My current problem goes as follows. Consider the dataframe:
name score date
0 Alice 22.0 2020-12-31
15 Alice 4.0 2005-12-31
5 Alice 1.0 2003-12-31
10 Alice NaN 2000-12-31
6 Bob 8.0 2001-11-02
16 Bob 1.0 2004-11-02
11 Bob 5.0 2003-11-02
1 Bob 45.0 1980-11-02
12 Chuck 9.0 2003-12-03
2 Chuck 4.0 2015-12-03
7 Chuck 9.0 2001-12-03
17 Chuck 2.0 2004-12-03
18 Daren 3.0 2004-03-13
21 Daren 89.0 2015-08-13
3 Daren NaN 2015-03-13
20 Daren 12.0 2015-05-13
8 Daren 5.0 2015-03-17
13 Daren 93.0 2003-03-13
14 Elisa 11.0 2003-01-24
9 Elisa 19.0 2001-01-24
19 Elisa 23.0 2004-01-24
4 Elisa 16.0 2010-01-24
Some score
values are nan
. I would like to replace these values with the person's closest score (w.r.t. time
), GIVEN that the time difference is within 1 year, i.e. 365 days. If there is no such time, then keep the value as nan
.
In the dataframe above, Daren's nan
score value would be replaced to 5, as there is a score which was recorded within a year of the nan
date and it is the smallest time difference from the other possible scores. However for Alice, there is no such score which was recorded within a year of the nan
score, thus her nan
score will be kept as nan
.
To reproduce the dataframe, use this code:
import numpy as np
import pandas as pd
import datetime
data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
"score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
"date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')