1

I have a very large DataFrame (10M+ records) and I an trying to perform a transformation on a datetime column for each Sku/Store combination.

Here's my current working (but not scalable) version:

for sku in sales_inv.Sku.unique():
    for store in sales_inv[sales_inv.Sku == sku].Location.unique():
        temp = sales_inv.loc[((sales_inv.Location == store) & (sales_inv.Sku == sku))]
        temp.loc[:,'dt'] = pd.date_range(end=temp.dt.max(), periods=temp.shape[0])

The reason I need to do this transformation is because there are missing dates and I want to simply fill the missing dates by replacing the entire dt Series with a continuous datetime array ending with the last date of observation for each Sku/Store group. The validity of the data is not of importance - i.e. I don't need the data to match with the actual date.

I think pd.DataFrame.groupby().apply() could be used here but I have not been successful yet. I tried using the approach from:

Apply multiple functions to multiple groupby columns

I tried two approaches:

pad_dates = lambda x: pd.date_range(end=x.max(), periods=x.size)
sales_inv.group_by(all_cols_but_dt).apply(pad_dates)

as well as

f = {'dt': pad_dates}
sales_inv.group_by(all_cols_but_dt).apply(f)

Without any luck. Looking for the fastest way to do the same as the for loop quicker. Any help is really appreciated.

EDIT:

Example

n = 5
d1 = {'Sku': ['one'] * n, 
      'Location': ['loc1'] * n,
      'dt': pd.date_range(end=dt.datetime.now().date(), periods=n),
      'on_hand': [1] * n,
      'sales': [2] * n}
d2 = {'Sku': ['two'] * n, 
      'Location': ['loc2'] * n,
      'dt': pd.date_range(end=dt.datetime.now().date(), periods=n),
      'on_hand': [2] * n,
      'sales': [4] * n}
df = pd.DataFrame(d1).drop(3, axis=0).append(pd.DataFrame(d2).drop(4,axis=0))

The correct should look like:

n = 4
# assign d1 and d2 using new 'n'
df = pd.DataFrame(d1).append(pd.DataFrame(d2))

Thanks

Community
  • 1
  • 1
Alek Liskov
  • 498
  • 4
  • 8

2 Answers2

1

Is that what you want?

In [62]: dt_rng = pd.date_range(df['dt'].min(), df['dt'].max())

In [63]: df.groupby('Sku') \
           .apply(lambda x: x.set_index('dt').reindex(dt_rng).ffill()) \
           .reset_index('Sku', drop=True)

EDIT:

Corrent answer:

Warning: Kind of a hack-y workaround, but it's using apply thus runs in under 30 seconds for this size DataFrame.

cols = df.columns
df = df.groupby(['Sku','Location']) \
       .apply(lambda x: x.set_index(pd.date_range(end=x.dt.max(), periods=x.shape[0]))) \
       .drop(['Sku','Location','dt'], axis = 1)
df = df.reset_index()
df.columns = cols

Result:

df

Out[59]: 
  Location   Sku         dt  on_hand  sales
0      one  loc1 2017-01-30        1      2
1      one  loc1 2017-01-31        1      2
2      one  loc1 2017-02-01        1      2
3      one  loc1 2017-02-02        1      2
4      two  loc2 2017-01-29        2      4
5      two  loc2 2017-01-30        2      4
6      two  loc2 2017-01-31        2      4
7      two  loc2 2017-02-01        2      4
Alek Liskov
  • 498
  • 4
  • 8
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

If all you want to do is fill the index with missing dates, it is straightforward with reindex:

idx = pd.date_range('01.01.2017', '01.10.2017')
idx_missing = idx[0:3].union(idx[5:])
vals = range(len(idx_missing))
df = pd.DataFrame(index=idx_missing, data=vals)
df
>>>
            0
2017-01-01  0
2017-01-02  1
2017-01-03  2
2017-01-06  3
2017-01-07  4
2017-01-08  5
2017-01-09  6
2017-01-10  7

df = df.reindex(idx, fill_value=999)
df
>>>
              0
2017-01-01    0
2017-01-02    1
2017-01-03    2
2017-01-04  999
2017-01-05  999
2017-01-06    3
2017-01-07    4
2017-01-08    5
2017-01-09    6
2017-01-10    7
andrew
  • 3,929
  • 1
  • 25
  • 38
  • 1
    I believe OP needs *Sku* and *Store* mapped as well with missing dates. – Parfait Feb 02 '17 at 01:18
  • Thank you for the suggestion but it doesn't achieve what's needed. The missing dates are information I want to preserve. if the **timedelta** between max and min dates is say 10 days but there are only 7 rows (thus 3 missing) I would like a **datetime** array with starting date of t=4 and ending date with t=10. I hope this helps. Thanks again – Alek Liskov Feb 02 '17 at 02:42