I have a very large DataFrame (10M+ records) and I an trying to perform a transformation on a datetime column for each Sku/Store combination.
Here's my current working (but not scalable) version:
for sku in sales_inv.Sku.unique():
for store in sales_inv[sales_inv.Sku == sku].Location.unique():
temp = sales_inv.loc[((sales_inv.Location == store) & (sales_inv.Sku == sku))]
temp.loc[:,'dt'] = pd.date_range(end=temp.dt.max(), periods=temp.shape[0])
The reason I need to do this transformation is because there are missing dates and I want to simply fill the missing dates by replacing the entire dt Series with a continuous datetime array ending with the last date of observation for each Sku/Store group. The validity of the data is not of importance - i.e. I don't need the data to match with the actual date.
I think pd.DataFrame.groupby().apply() could be used here but I have not been successful yet. I tried using the approach from:
Apply multiple functions to multiple groupby columns
I tried two approaches:
pad_dates = lambda x: pd.date_range(end=x.max(), periods=x.size)
sales_inv.group_by(all_cols_but_dt).apply(pad_dates)
as well as
f = {'dt': pad_dates}
sales_inv.group_by(all_cols_but_dt).apply(f)
Without any luck. Looking for the fastest way to do the same as the for loop quicker. Any help is really appreciated.
EDIT:
Example
n = 5
d1 = {'Sku': ['one'] * n,
'Location': ['loc1'] * n,
'dt': pd.date_range(end=dt.datetime.now().date(), periods=n),
'on_hand': [1] * n,
'sales': [2] * n}
d2 = {'Sku': ['two'] * n,
'Location': ['loc2'] * n,
'dt': pd.date_range(end=dt.datetime.now().date(), periods=n),
'on_hand': [2] * n,
'sales': [4] * n}
df = pd.DataFrame(d1).drop(3, axis=0).append(pd.DataFrame(d2).drop(4,axis=0))
The correct should look like:
n = 4
# assign d1 and d2 using new 'n'
df = pd.DataFrame(d1).append(pd.DataFrame(d2))
Thanks