0

My current problem goes as follows. Consider the dataframe:

       name     score   date
0      Alice    22.0    2020-12-31
15     Alice    4.0     2005-12-31
5      Alice    1.0     2003-12-31
10     Alice    NaN     2000-12-31
6      Bob      8.0     2001-11-02
16     Bob      1.0     2004-11-02
11     Bob      5.0     2003-11-02
1      Bob      45.0    1980-11-02
12     Chuck    9.0     2003-12-03
2      Chuck    4.0     2015-12-03
7      Chuck    9.0     2001-12-03
17     Chuck    2.0     2004-12-03
18     Daren    3.0     2004-03-13
21     Daren    89.0    2015-08-13
3      Daren    NaN     2015-03-13
20     Daren    12.0    2015-05-13
8      Daren    5.0     2015-03-17
13     Daren    93.0    2003-03-13
14     Elisa    11.0    2003-01-24
9      Elisa    19.0    2001-01-24
19     Elisa    23.0    2004-01-24
4      Elisa    16.0    2010-01-24

Some score values are nan. I would like to replace these values with the person's closest score (w.r.t. time), GIVEN that the time difference is within 1 year, i.e. 365 days. If there is no such time, then keep the value as nan.

In the dataframe above, Daren's nan score value would be replaced to 5, as there is a score which was recorded within a year of the nan date and it is the smallest time difference from the other possible scores. However for Alice, there is no such score which was recorded within a year of the nan score, thus her nan score will be kept as nan.

To reproduce the dataframe, use this code:

import numpy as np
import pandas as pd
import datetime

data = {"name": ['Alice', 'Bob', 'Chuck', 'Daren', 'Elisa', 'Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Alice', 'Bob', 'Chuck', 'Daren', 'Elisa','Daren','Daren'],
        "score": [22,45,4,np.nan,16,1,8,9,5,19,np.nan,5,9,93,11,4,1,2,3,23,12,89],
        "date": ['31/12/2020','11/02/1980','12/03/2015','13/03/2015','24/01/2010','31/12/2003','11/02/2001','12/03/2001','17/03/2015','24/01/2001','31/12/2000','11/02/2003','12/03/2003','13/03/2003','24/01/2003','31/12/2005','11/02/2004','12/03/2004','13/03/2004','24/01/2004','13/05/2015','13/08/2015']}
df = pd.DataFrame(data = data)
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by='name')

You_Donut
  • 155
  • 8
  • 2
    It'd be a lot easier to read the df if you sorted by name and date. You could probably get rid of the irrelevant names too, for the sake of example, or at least some of them. – wjandrea Dec 08 '22 at 18:48
  • 1
    It'd help to provide a [reproducible pandas example](/q/20109391/4518341). I had to do a bit of work to get this into a usable state: `pd.read_csv(..., sep=' ', skipinitialspace=True, dtype={'score': 'Int64'}, parse_dates=['date'], dayfirst=True)` – wjandrea Dec 08 '22 at 18:50
  • Yes @wjandrea it's hard when trying to reproduce table while testing..messing arround with dates and names from past 20 minutes – Bhargav - Retarded Skills Dec 08 '22 at 18:51
  • Noted. Please see the code attached. Hope this helps. – You_Donut Dec 08 '22 at 18:53

3 Answers3

1

Assuming df is the same dataframe as given in the question above:

df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y') # Create a datetime out of date column
df = df.sort_values('date').reset_index(drop=True) # Sorting the dates to get closest values from NaN
nan_index = df.index[df['score'].isna()].to_list() # Get all index values where value is NaN
for idx in nan_index: # Loop over all NaN values and check condition
    if idx-1 < 0: # If no time available previously, we skip such cases as there is no score within a year.
        continue
    curr_date = df.at[idx, 'date']
    prev_date = df.at[idx-1, 'date']
    if ((curr_date - prev_date).days <= 365): # Check if time difference is within 365 days
        df.at[idx, 'score'] = df.at[idx+1, 'score'] # Replace NaN with next closest value
print(df)
Pranav Shetty
  • 36
  • 1
  • 3
0

First sort values based on names and extract year from date column. Then group by year

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['name'])
df = df.sort_values(['date'])
df['year'] = pd.DatetimeIndex(df['date']).year
print(df.groupby('year').bfill())
df.sort_index(inplace=True)

Gives #

     name  score       date
0   Alice   22.0 2020-12-31
1     Bob   45.0 1980-11-02
2   Chuck    4.0 2015-12-03
3   Daren    5.0 2015-03-13
4   Elisa   16.0 2010-01-24
5   Alice    1.0 2003-12-31
6     Bob    8.0 2001-11-02
7   Chuck    9.0 2001-12-03
8   Daren    5.0 2015-03-17
9   Elisa   19.0 2001-01-24
10  Alice    NaN 2000-12-31
11    Bob    5.0 2003-11-02
12  Chuck    9.0 2003-12-03
13  Daren   93.0 2003-03-13
14  Elisa   11.0 2003-01-24
15  Alice    4.0 2005-12-31
16    Bob    1.0 2004-11-02
17  Chuck    2.0 2004-12-03
18  Daren    3.0 2004-03-13
19  Elisa   23.0 2004-01-24
20  Daren   12.0 2015-05-13
21  Daren   89.0 2015-08-13
>>> 
0
data = df.sort_values(by=['name', 'date'])

data = pd.concat([data,
           data.shift(-1)[['score', 'date']].rename(columns={'score': 'score_next', 'date': 'date_next'}),
           data.shift(1)[['score', 'date']].rename(columns={'score': 'score_prev', 'date': 'date_prev'})
           ],
          axis=1)

def setter(x):
    r = x['date'] - x['date_prev']
    r = float('inf') if type(r) == pd._libs.tslibs.nattype.NaTType else abs(r.days)
    v = x['date'] - x['date_prev']
    v = float('inf') if type(v) == pd._libs.tslibs.nattype.NaTType else abs(v.days)
    if r > v:
        return x['score_prev']
    return x['score_next']
    
df.score.fillna(data[data.score.isna()].apply(setter, axis=1), inplace=True)
MoRe
  • 2,296
  • 2
  • 3
  • 23