0

Similar to this post Excel VLOOKUP equivalent in pandas

I don't need the first value it comes across, but the n'th value.

Here is an example data set, with the desired output.

from pandas import DataFrame
from datetime import timedelta
data = {'date': ['2018-01-01','2018-01-01','2018-01-01','2018-01-02','2018-01-02',
                 '2018-01-03','2018-01-03','2018-01-03','2018-01-04','2018-01-04',
                 '2018-01-04','2018-01-05','2018-01-05','2018-01-05','2018-01-06',
                 '2018-01-06'], 
        'product': ['123a','123b','123c', '123a', '123b', '123a', '123b', '123c',
                    '123a', '123b', '123c', '123a', '123b','123c', '123a', '123c'],
        'orders': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],
        'desired_output': [0,0,0,0,0,0,0,0,1,2,3,4,5,0,6,8]}

df = DataFrame(data, columns = ['date', 'product', 'orders', 'desired_output'])
df.date = pd.to_datetime(df.date)
df['lag_date'] = df.date - timedelta(days=3)

Example index 14: product 123a, lag_date 2018-01-03, look at date column for product 123a with date 2018-01-03 and show matching orders, hence 6, if no match, return 0.

Currently, I lag the date 3 days, but I want this to be 'n'. I could use the orginial dates as index, but then I would need to reindex the data set later (which is fine).

Is there a handy way for this, instead of looping through all rows, put a counter 'n', and when there are 'n' matches found, take that value. Since my data set has over 500k rows this seems computationally too expensive for a pretty simple task.

2 Answers2

0

It's probably not the nicest solution, but it works:

Depending on how I set lag_date, I can lag the data for that specific date.

# first create new unique identifiers, based on the data + the product code
df.date = df.date.dt.strftime('%Y-%m-%d') # first convert for concatenating
df['vlook_date'] = df.date + df['product'].astype(str)
df.lag_date = df.lag_date.dt.strftime('%Y-%m-%d')
df['vlook_lagdate'] = df.lag_date + df['product'].astype(str)

# create new data frames to map them together
df1 = df.loc[:, df.columns != 'vlook_date']
df2 = df.loc[:, df.columns != 'vlook_lagdate']

# use map to match the results in df1
df1['lag_orders'] = pd.to_numeric(df1.vlook_lagdate.map(df2.set_index('vlook_date')['orders']).fillna(0),downcast='integer')
df1 = df1.drop(['lag_date','vlook_lagdate'], axis = 1)

If there are suggestions to clean this up let me know ;)

0

This is probably slower than your answer, since it loops, but possibly simpler to understand:

grp = df.groupby(['date', 'product'])
desired_output = []
n, lag = 0, 3

for i in df.iterrows():
    try:
        desired_output.append(grp.get_group((i[1]['date'] - timedelta(days=lag), i[1]['product'])).iloc[n, 2])
    except KeyError:
        desired_output.append(np.nan)

df['desired_output'] = desired_output

Output:

    date    product orders desired_output lag_date
0   2018-01-01  123a    1   NaN     2017-12-29
1   2018-01-01  123b    2   NaN     2017-12-29
2   2018-01-01  123c    3   NaN     2017-12-29
3   2018-01-02  123a    4   NaN     2017-12-30
4   2018-01-02  123b    5   NaN     2017-12-30
5   2018-01-03  123a    6   NaN     2017-12-31
6   2018-01-03  123b    7   NaN     2017-12-31
7   2018-01-03  123c    8   NaN     2017-12-31
8   2018-01-04  123a    9   1.0     2018-01-01
9   2018-01-04  123b    10  2.0     2018-01-01
10  2018-01-04  123c    11  3.0     2018-01-01
11  2018-01-05  123a    12  4.0     2018-01-02
12  2018-01-05  123b    13  5.0     2018-01-02
13  2018-01-05  123c    14  NaN     2018-01-02
14  2018-01-06  123a    15  6.0     2018-01-03
15  2018-01-06  123c    16  8.0     2018-01-03
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75