How to extrapolate missing values with groupby - Python?

Question

I have the following dataset:

data = {
  'date': ['1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019', '1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019'],
  'account_id': [1, 1, 1, 1, 2, 2, 2, 2],
  'value_1': [1, 2, 3, 4, 5, 6, 7, 8],
  'value_2': [1, 3, 6, 9, 10, 12, 14, 16]
}
df = pd.DataFrame(data,index = data['date']).drop('date', 1)
df

What I need is to extrapolate value 1 and value 2 forward by 30 days.

I came across Extrapolate Pandas DataFrame. It would work beautifully if there were no duplicated entries in the date column.

I thought of using sth of this sort but I don't understand how to add v to the function:

def extrapolation(df):
    extend = 1
    y = pd.DataFrame(
        data=df,
        index=pd.date_range(
            start=df.index[0],
            periods=len(df.index) + extend
        )
    )
    #then, the extrapolation piece


df_out=df.head(0).copy()
for k,v in df.groupby('account_id'):
    df_out=pd.concat([df_out,extrapolation(df)])

Please please, include the sample data as **text** not picture, see [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Quang Hoang, Jun 11 '19 at 16:45
One more thing, how would you want to extrapolate each value for each id? Linear? quadratic, cubic, etc... — Quang Hoang, Jun 11 '19 at 17:06
Linear. I know that it is a bit of a naive approach but I want to start with linear. — eponkratova, Jun 11 '19 at 17:07

Quang Hoang · Accepted Answer · 2019-06-11T18:32:16.060

You can modify the linked answer as follows:

def extrapolate(df):
    new_max = df.index.max() + pd.to_timedelta('30D')
    dates = pd.date_range(df.index.min(), new_max, freq='D')
    ret_df = df.reindex(dates)

    x = np.arange(len(df))

    # new x values
    new_x = pd.Series(np.arange(len(ret_df)), index=dates)

    for col in df.columns:
        fit = np.polyfit(x, df[col], 1)

        # tranform and fill
        ret_df[col].fillna(fit[0]*new_x + fit[1], inplace=True)

    return ret_df

and then apply:

ext_cols = ['value_1', 'value_2']

df.groupby('account_id')[ext_cols].apply(extrapolate)

You can also specify the polynomial orders for each column:

poly_orders = [1,2]
ext_cols = ['value_1', 'value_2']

def extrapolate(df):
    new_max = df.index.max() + pd.to_timedelta('30D')
    dates = pd.date_range(df.index.min(), new_max, freq='D')
    ret_df = df.reindex(dates)

    x = np.arange(len(df))

    # new x values
    new_x = pd.Series(np.arange(len(ret_df)), index=dates)

    for col, o in zip(ext_cols, poly_orders):
        fit = np.polyfit(x, df[col], o)

        print(fit)

        # tranform and fill
        new_vals = pd.Series(0, index=dates)

        for i in range(1,o+1):
            new_vals = new_x**i * fit[o-i]

        ret_df[col].fillna(new_vals, inplace=True)

    return ret_df

And use sklearn.linear_model.LinearRegression for better manipulation of input/output instead of numpy.polyfit.

You are the God! new_vals = new_x**i * fit[o-i] should be new_vals = new_x**i + fit[o-i]? One more question, if I need to cap the values i.e. they cannot grow infiniately and let's assume I have two more columns value_1.1 and value_2.1 with constant values, do I need to put ret_df[col].fillna(fit[0]*new_x + fit[1], inplace=True) into IF statument? — eponkratova, Jun 12 '19 at 00:38
If it doesn’t overflow, let it grow and use `clip` afterward. — Quang Hoang, Jun 12 '19 at 00:40

How to extrapolate missing values with groupby - Python?

1 Answers1