I have two dataframes in python using pandas:
- df1: [233,500 x 6] ledger of prices by date
- df2: [1,665,997 x 5] ledger of customer usage by date
On a smaller data set in excel I can simply do the following to create a column on df2 containing each products price on a given date, by row.
=SUMIFS(df1[Rate],df1[Date],[@Date],df1[Jurisdiction],[@Jurisdiction],df1[Product],[@Product])
Ultimately I want to add a column to df2 (or make a new dataframe with the result) that provides price by matching some criteria in each row (Jurisdiction, Product Type) with the same criteria in the price ledger df1.
What's the most appropriate way to associate this data using python and dataframes? Some sort of dictionary and a join?
The examples I've found are mainly dealing with summing given some conditions:
Bonus: The dates in df1 and df2 will not always match identically. Matching the date in df2 with the most recent price ledger date will need to happen.
Edit: I've include simplified data below to demonstrate how I'm trying to do the most-recent date matching. The date matching is just a middle step towards finding the active price on each date in df2. The direct merge on the two date columns doesn't work because prices are not provided every day.
df1:
Date Price
1/11/2016 5.00
1/12/2016 5.50
1/13/2016 6.00
1/14/2016 7.00
1/16/2016 8.00
1/20/2016 9.00
1/21/2016 10.00
1/22/2016 11.00
df2:
Date Volume
1/11/2016 100
1/15/2016 100
1/17/2016 200
1/18/2016 300
1/20/2016 200
df3: (df2 with date matching. Cost = Volume*Price)
Date Volume MatchedDate Price Cost
1/11/2016 100 1/11/2016 5.00 500
1/15/2016 100 1/14/2016 7.00 700
1/17/2016 200 1/16/2016 8.00 1600
1/18/2016 300 1/16/2016 8.00 2400
1/20/2016 200 1/20/2016 9.00 1800
Edit 2: The formula provided in the first section in the following works, in excel, with some extra logic to use the first date for df2 entries preceding the first df1 entry. https://www.extendoffice.com/documents/excel/2601-excel-find-closest-date.html
{=MAX((df1[Date]<[@Date])*df1[Date])}