Replace nan values with the value of the closest date in a dataframe

Question

My current problem goes as follows. Consider the dataframe:

       name     score   date
0      Alice    22.0    2020-12-31
15     Alice    4.0     2005-12-31
5      Alice    1.0     2003-12-31
10     Alice    NaN     2000-12-31
6      Bob      8.0     2001-11-02
16     Bob      1.0     2004-11-02
11     Bob      5.0     2003-11-02
1      Bob      45.0    1980-11-02
12     Chuck    9.0     2003-12-03
2      Chuck    4.0     2015-12-03
7      Chuck    9.0     2001-12-03
17     Chuck    2.0     2004-12-03
18     Daren    3.0     2004-03-13
21     Daren    89.0    2015-08-13
3      Daren    NaN     2015-03-13
20     Daren    12.0    2015-05-13
8      Daren    5.0     2015-03-17
13     Daren    93.0    2003-03-13
14     Elisa    11.0    2003-01-24
9      Elisa    19.0    2001-01-24
19     Elisa    23.0    2004-01-24
4      Elisa    16.0    2010-01-24

Some score values are nan. I would like to replace these values with the person's closest score (w.r.t. time), GIVEN that the time difference is within 1 year, i.e. 365 days. If there is no such time, then keep the value as nan.

In the dataframe above, Daren's nan score value would be replaced to 5, as there is a score which was recorded within a year of the nan date and it is the smallest time difference from the other possible scores. However for Alice, there is no such score which was recorded within a year of the nan score, thus her nan score will be kept as nan.

To reproduce the dataframe, use this code:

import numpy as np
import pandas as pd
import datetime

data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
        "score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
        "date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')

It'd be a lot easier to read the df if you sorted by name and date. You could probably get rid of the irrelevant names too, for the sake of example, or at least some of them. — wjandrea, Dec 08 '22 at 18:48
It'd help to provide a [reproducible pandas example](/q/20109391/4518341). I had to do a bit of work to get this into a usable state: `pd.read_csv(..., sep=' ', skipinitialspace=True, dtype={'score': 'Int64'}, parse_dates=['date'], dayfirst=True)` — wjandrea, Dec 08 '22 at 18:50
Yes @wjandrea it's hard when trying to reproduce table while testing..messing arround with dates and names from past 20 minutes — Bhargav - Retarded Skills, Dec 08 '22 at 18:51

score 1 · Answer 1 · answered Dec 08 '22 at 18:47

Assuming df is the same dataframe as given in the question above:

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y') # Create a datetime out of date column
df = df.sort_values('date').reset_index(drop=True) # Sorting the dates to get closest values from NaN
nan_index = df.index[df['score'].isna()].to_list() # Get all index values where value is NaN
for idx in nan_index: # Loop over all NaN values and check condition
    if idx-1 < 0: # If no time available previously, we skip such cases as there is no score within a year.
        continue
    curr_date = df.at[idx, 'date']
    prev_date = df.at[idx-1, 'date']
    if ((curr_date - prev_date).days <= 365): # Check if time difference is within 365 days
        df.at[idx, 'score'] = df.at[idx+1, 'score'] # Replace NaN with next closest value
print(df)

score 0 · Answer 2 · answered Dec 08 '22 at 19:06

0

First sort values based on names and extract year from date column. Then group by year

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['name'])
df = df.sort_values(['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df.groupby('year').bfill())
df.sort_index(inplace=True)

Gives #

     name  score       date
0   Alice   22.0 2020-12-31
1     Bob   45.0 1980-11-02
2   Chuck    4.0 2015-12-03
3   Daren    5.0 2015-03-13
4   Elisa   16.0 2010-01-24
5   Alice    1.0 2003-12-31
6     Bob    8.0 2001-11-02
7   Chuck    9.0 2001-12-03
8   Daren    5.0 2015-03-17
9   Elisa   19.0 2001-01-24
10  Alice    NaN 2000-12-31
11    Bob    5.0 2003-11-02
12  Chuck    9.0 2003-12-03
13  Daren   93.0 2003-03-13
14  Elisa   11.0 2003-01-24
15  Alice    4.0 2005-12-31
16    Bob    1.0 2004-11-02
17  Chuck    2.0 2004-12-03
18  Daren    3.0 2004-03-13
19  Elisa   23.0 2004-01-24
20  Daren   12.0 2015-05-13
21  Daren   89.0 2015-08-13
>>>

answered Dec 08 '22 at 19:06

Bhargav - Retarded Skills

3,154
1
6
22

Thanks for the great answer. However, doesn't this just look at the nearest date within the year? Rather than looking for the nearest date _within_ +/- 365 days. – You_Donut Dec 08 '22 at 19:25
Ahh I just seen if output matches above case...If you want for +/- days group by entire date – Bhargav - Retarded Skills Dec 08 '22 at 19:28
Would you be able to modify the code above? Sorry I still don't understand – You_Donut Dec 08 '22 at 19:31
I guess we are sorting date first and & then extracting year in my code...It should satisify looking for the nearest date within +/- 365 days. Logic I've given perfectly fits your case mentioned above – Bhargav - Retarded Skills Dec 08 '22 at 19:34
Thank you for your response. Would you be able to modify the code so that myself and others viewing the thread could see? – You_Donut Dec 08 '22 at 19:43
Have you tried above code? Deos it have any issue...I guess no need to edit anything..Logic should work above mentioned case – Bhargav - Retarded Skills Dec 08 '22 at 19:45
Worked? deos require any modifications? – Bhargav - Retarded Skills Dec 12 '22 at 15:51

score 0 · Answer 3 · answered Dec 08 '22 at 20:20

data = df.sort_values(by=['name', 'date'])

data = pd.concat([data,
           data.shift(-1)[['score', 'date']].rename(columns={'score': 'score_next', 'date': 'date_next'}),
           data.shift(1)[['score', 'date']].rename(columns={'score': 'score_prev', 'date': 'date_prev'})
           ],
          axis=1)

def setter(x):
    r = x['date'] - x['date_prev']
    r = float('inf') if type(r) == pd._libs.tslibs.nattype.NaTType else abs(r.days)
    v = x['date'] - x['date_prev']
    v = float('inf') if type(v) == pd._libs.tslibs.nattype.NaTType else abs(v.days)
    if r > v:
        return x['score_prev']
    return x['score_next']
    
df.score.fillna(data[data.score.isna()].apply(setter, axis=1), inplace=True)

Replace nan values with the value of the closest date in a dataframe

3 Answers3