Multiply pandas dataframe by vlookup

Question

I have a very large dataframe with multiple years of sales data and tens of thousands of skew_ids (i.e.):

     date      skew_id  units_sold
0 2001-01-01      123       1
1 2001-01-02      123       2
2 2001-01-03      123       3
3 2001-01-01      456       4
4 2001-01-02      456       5
...

I have another dataframe that maps skew_ids to skew_price (i.e.):

  skew_id  skew_price
0   123       100.00
1   456       10.00
...

My first dataframe is so large that I cannot merge without hitting my memory limit.

I'd like to calculate the daily revenues (i.e.):

    date       revenue
0 2001-01-01      140      
1 2001-01-02      250       
2 2001-01-03      300       
...

I suggest you split the large `df` to smaller pieces, and merge them one by one (in a loop, of course). — Aryerez, Nov 22 '19 at 13:40

jezrael · Accepted Answer · 2019-11-22T14:01:53.223

2

I think it depends of number of rows, number of unique skew_id values and size of RAM.

One possible solution with map:

df1['revenue'] = df1['skew_id'].map(df2.set_index('skew_id')['skew_price']) * df1['units_sold']

df2 = df1.groupby('date', as_index=False)['revenue'].sum()

edited Nov 22 '19 at 14:01

answered Nov 22 '19 at 13:41

jezrael

822,522
95
1,334
1,252

Is it less memory consumpting than `merge`? – Aryerez Nov 22 '19 at 13:42
@Aryerez - I think yes – jezrael Nov 22 '19 at 13:43
1

I hope the OP would comment if it did the job. I may use it one day myself. – Aryerez Nov 22 '19 at 13:46
How does this work for multiple columns as mapping objective? Say df1['id2'] = ['a', 'a', 'b', 'b', 'a'] and df2 would have two additional rows for each combination of skew_id and id2? Or should I post that as a separate question? [This question](https://stackoverflow.com/questions/51744786/pandas-dataframe-object-has-no-attribute-map) came close – Adriaan May 03 '21 at 13:28

score 1 · Answer 2 · answered Nov 22 '19 at 13:57

You could achieve this with a groupby:

df.groupby('date').apply(lambda gr: df2.loc[df2.skew_id.isin(list(gr.skew_id))]['skew_price'].sum())

Or if you run into memory problems you could loop over all dates yourself. This is slower, but might need less memory.

revenue = []
for d in df.date.unique():
    r = df2.loc[df2.skew_id.isin(list(df.loc[df.date == d].skew_id))]['skew_price'].sum()
    revenue.append({'date': d, 'revenue': r})
pd.DataFrame(revenue)

Multiply pandas dataframe by vlookup

2 Answers2