pandas dataframe interpolate for Nans with groupby using window of discrete days of the year

Question

The small reproducible example below sets up a dataframe that is 100 yrs in length containing some randomly generated values. It then inserts 3 100-day stretches of missing values. Using this small example, I am attempting to sort out the pandas commands that will fill in the missing days using average values for that day of the year (hence the use of .groupby) with a condition. For example, if April 12th is missing, how can the last line of code be altered such that only the 10 nearest April 12th's are used to fill in the missing value? In other words, a missing April 12th value in 1920 would be filled in using the mean April 12th values between 1915 to 1925; a missing April 12th value in 2000 would be filled in with the mean April 12th values between 1995 to 2005, etc. I tried playing around with adding a .rolling() to the lambda function in last line of script, but was unsuccessful in my attempt.

Bonus question: The example below extends from 1918 to 2018. If a value is missing on April 12th 1919, for example, it would still be nice if ten April 12ths were used to fill in the missing value even though the window couldn't be 'centered' on the missing day because of its proximity to the beginning of the time series. Is there a solution to the first question above that would be flexible enough to still use a minimum of 10 values when missing values are close to the beginning and ending of the time series?

import pandas as pd
import numpy as np
import random

# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100

# Create dataframe
df = pd.DataFrame(dict(                                                   
    list(                                                                 
        zip(["Date", "vals"],                                              
            [dates, vals])                                 
        )                                                                 
))  

# confirm missing vals
df.iloc[95:105]
df.iloc[35890:35900]

# set a date index (for use by groupby)
df.index = pd.DatetimeIndex(df['Date'])
df['Date'] = df.index

# Need help restricting the mean to the 10 nearest same-days-of-the-year:
df['vals'] = df.groupby([df.index.month, df.index.day])['vals'].transform(lambda x: x.fillna(x.mean()))

if your NA is not too many, ie. not worrying speed, try this https://stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups basically loop thru all NAN and sort (by distance of year) and get head 10 — Bing Wang, Feb 17 '21 at 04:13

score 1 · Accepted Answer · answered Feb 17 '21 at 11:05

This answers both parts

build a DF dfr that is the calculation you want
lambda function returns a dict {year:val, ...}
make sure indexes are named in reasonable way
expand out dict with apply(pd.Series)
reshape by putting year columns back into index
merge() built DF with original DF. vals column contains NaN 0 column is value to fill
finally fillna()

# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100

# Create dataframe - simplified from question...
df = pd.DataFrame({"Date":dates,"vals":vals})  

df[df.isna().any(axis=1)]

ystart = df.Date.dt.year.min()
# generate rolling means for month/day.  bfill for when it's start of series
dfr = (df.groupby([df.Date.dt.month, df.Date.dt.day])["vals"]
 .agg(lambda s: {y+ystart:v for y,v in enumerate(s.dropna().rolling(5).mean().bfill())})
 .to_frame().rename_axis(["month","day"])
)

# expand dict into columns and reshape to by indexed by month,day,year
dfr = dfr.join(dfr.vals.apply(pd.Series)).drop(columns="vals").rename_axis("year",axis=1).stack().to_frame()

# get df index back, plus vals & fillna (column 0) can be seen alongside each other
dfm = df.merge(dfr, left_on=[df.Date.dt.month,df.Date.dt.day,df.Date.dt.year], right_index=True)

# finally what we really want to do - fill tha NaNs
df.fillna(dfm[0])

analysis

taking NaN for 11-Apr-1918, default is 22 as it's backfilled from 1921
(12+2+47+47+2)/5 == 22

dfm.query("key_0==4 & key_1==11").head(7)

	key_0	key_1	key_2	Date	vals	0
100	4	11	1918	1918-04-11 00:00:00	nan	22
465	4	11	1919	1919-04-11 00:00:00	12	22
831	4	11	1920	1920-04-11 00:00:00	2	22
1196	4	11	1921	1921-04-11 00:00:00	47	27
1561	4	11	1922	1922-04-11 00:00:00	47	36
1926	4	11	1923	1923-04-11 00:00:00	2	34.6
2292	4	11	1924	1924-04-11 00:00:00	37	29.4

score 0 · Answer 2 · answered Feb 17 '21 at 08:56

I'm not sure how far I've gotten with the intent of your question. The approach I've taken is to satisfy two requirements

Need an arbitrary number of averages
Use those averages to fill in the NA

I have addressed the Simply put, instead of filling in the NA with before and after dates, I fill in the NA with averages extracted from any number of years in a row.

import pandas as pd
import numpy as np
import random

# create 100 yr time series
dates = pd.date_range(start="1918-01-01", end="2018-12-31").strftime("%Y-%m-%d")
vals = [random.randrange(1, 50, 1) for i in range(len(dates))]
# Create some arbitrary gaps
vals[100:200] = vals[9962:10062] = vals[35895:35995] = [np.nan] * 100

# Create dataframe
df = pd.DataFrame(dict(                                                   
    list(                                                                 
        zip(["Date", "vals"],                                              
            [dates, vals])                                 
        )                                                                 
))  

df['Date'] = pd.to_datetime(df['Date'])
df['mm-dd'] = df['Date'].apply(lambda x:'{:02}-{:02}'.format(x.month, x.day))
df['yyyy'] = df['Date'].apply(lambda x:'{:04}'.format(x.year))
df = df.iloc[:,1:].pivot(index='mm-dd', columns='yyyy')
df.columns = df.columns.droplevel(0)
df['nans'] = df.isnull().sum(axis=1)
df['10n_mean'] = df.iloc[:,:-1].sample(n=10, axis=1).mean(axis=1)
df['10n_mean'] = df['10n_mean'].round(1)

df.loc[df['nans'] >= 1]
yyyy    1918    1919    1920    1921    1922    1923    1924    1925    1926    1927    ...     2011    2012    2013    2014    2015    2016    2017    2018    nans    10n_mean
mm-dd                                                                                   
02-29   NaN     NaN     34.0    NaN     NaN     NaN     2.0     NaN     NaN     NaN     ...     NaN     49.0    NaN     NaN     NaN     32.0    NaN     NaN     76  21.6
04-11   NaN     43.0    12.0    28.0    29.0    28.0    1.0     38.0    11.0    3.0     ...     17.0    35.0    8.0     17.0    34.0    NaN     5.0     33.0    3   29.7
04-12   NaN     19.0    38.0    34.0    48.0    46.0    28.0    29.0    29.0    14.0    ...     41.0    16.0    9.0     39.0    8.0     NaN     1.0     12.0    3   21.3
04-13   NaN     33.0    26.0    47.0    21.0    26.0    20.0    16.0    11.0    7.0     ...     5.0     11.0    34.0    28.0    27.0    NaN     2.0     46.0    3   21.3
04-14   NaN     36.0    19.0    6.0     45.0    41.0    24.0    39.0    1.0     11.0    ...     30.0    47.0    45.0    14.0    48.0    NaN     16.0    8.0     3   24.7

df_mean = df.T.fillna(df['10n_mean'], downcast='infer').T
df_mean.loc[df_mean['nans'] >= 1]
yyyy    1918    1919    1920    1921    1922    1923    1924    1925    1926    1927    ...     2011    2012    2013    2014    2015    2016    2017    2018    nans    10n_mean
mm-dd                                                                                   
02-29   21.6    21.6    34.0    21.6    21.6    21.6    2.0     21.6    21.6    21.6    ...     21.6    49.0    21.6    21.6    21.6    32.0    21.6    21.6    76.0    21.6
04-11   29.7    43.0    12.0    28.0    29.0    28.0    1.0     38.0    11.0    3.0     ...     17.0    35.0    8.0     17.0    34.0    29.7    5.0     33.0    3.0     29.7
04-12   21.3    19.0    38.0    34.0    48.0    46.0    28.0    29.0    29.0    14.0    ...     41.0    16.0    9.0     39.0    8.0     21.3    1.0     12.0    3.0     21.3
04-13   21.3    33.0    26.0    47.0    21.0    26.0    20.0    16.0    11.0    7.0     ...     5.0     11.0    34.0    28.0    27.0    21.3    2.0     46.0    3.0     21.3
04-14   24.7    36.0    19.0    6.0     45.0    41.0    24.0    39.0    1.0     11.0    ...     30.0    47.0    45.0    14.0    48.0    24.7    16.0    8.0     3.0     24.7

pandas dataframe interpolate for Nans with groupby using window of discrete days of the year

2 Answers2

analysis