0

I have a very large dataframe with multiple years of sales data and tens of thousands of skew_ids (i.e.):

     date      skew_id  units_sold
0 2001-01-01      123       1
1 2001-01-02      123       2
2 2001-01-03      123       3
3 2001-01-01      456       4
4 2001-01-02      456       5
...

I have another dataframe that maps skew_ids to skew_price (i.e.):

  skew_id  skew_price
0   123       100.00
1   456       10.00
...

My first dataframe is so large that I cannot merge without hitting my memory limit.

I'd like to calculate the daily revenues (i.e.):

    date       revenue
0 2001-01-01      140      
1 2001-01-02      250       
2 2001-01-03      300       
...

2 Answers2

2

I think it depends of number of rows, number of unique skew_id values and size of RAM.

One possible solution with map:

df1['revenue'] = df1['skew_id'].map(df2.set_index('skew_id')['skew_price']) * df1['units_sold']

df2 = df1.groupby('date', as_index=False)['revenue'].sum()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Is it less memory consumpting than `merge`? – Aryerez Nov 22 '19 at 13:42
  • @Aryerez - I think yes – jezrael Nov 22 '19 at 13:43
  • 1
    I hope the OP would comment if it did the job. I may use it one day myself. – Aryerez Nov 22 '19 at 13:46
  • How does this work for multiple columns as mapping objective? Say df1['id2'] = ['a', 'a', 'b', 'b', 'a'] and df2 would have two additional rows for each combination of skew_id and id2? Or should I post that as a separate question? [This question](https://stackoverflow.com/questions/51744786/pandas-dataframe-object-has-no-attribute-map) came close – Adriaan May 03 '21 at 13:28
1

You could achieve this with a groupby:

df.groupby('date').apply(lambda gr: df2.loc[df2.skew_id.isin(list(gr.skew_id))]['skew_price'].sum())

Or if you run into memory problems you could loop over all dates yourself. This is slower, but might need less memory.

revenue = []
for d in df.date.unique():
    r = df2.loc[df2.skew_id.isin(list(df.loc[df.date == d].skew_id))]['skew_price'].sum()
    revenue.append({'date': d, 'revenue': r})
pd.DataFrame(revenue)
mjspier
  • 6,386
  • 5
  • 33
  • 43