Python selecting row from second dataframe based on complex criteria

Question

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.

df1:
purchased_at  product_id  cost
01-01-2017    1           £10
01-01-2017    2           £8
09-01-2017    1           £10
18-01-2017    3           £12

df2:
week_no  week_start  week_end
1        31-12-2016  06-01-2017
2        07-01-2017  13-01-2017
3        14-01-2017  20-01-2017

I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.

df1:
purchased_at  product_id  cost  week_no
01-01-2017    1           £10   1
01-01-2017    2           £8    1
09-01-2017    1           £10   2
18-01-2017    3           £12   3

I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.

df1.loc[(df1['purchased_at'] < df2['week_end']) & 
        (df1['purchased_at'] > df2['week_start']), df2['week_no']

was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'

Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.

edit to add further detail of df1

df1 full dataframe headers

purchased_at  purchase_id  product_id  product_name  transaction_id  account_number  cost
01-01-2017    1            1           A             1               AA001           £10
01-01-2017    2            2           B             1               AA001           £8
02-01-2017    3            1           A             2               AA008           £10
03-01-2017    4            3           C             3               AB040           £12
...  
09-01-2017   12            1           A             10              AB102           £10
09-01-2017   13            2           B             11              AB102           £8
...
18-01-2017   20            3           C             15              AA001           £12

So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.

parsing the date of `df1` should be enough since the definition you are using for the weeks seems to be the *standard* one. So, take a look at [this](https://stackoverflow.com/questions/2600775/how-to-get-week-number-in-python) instead and forget about `df2` for the moment. — Ma0, Jul 06 '17 at 08:06
In your query, the dataframes will (in general) have completely different shapes. You need to construct in each data frame a key on which you can [join on](https://pandas.pydata.org/pandas-docs/stable/merging.html) (try start day of the week in df1). Together with the upper comment this should hint to a solution — Quickbeam2k1, Jul 06 '17 at 08:42
It's not quite the standard definition as the count will continue in future years, so next year will be weeks 53-104 and so on, which is why I'd wanted to join it separately rather than calculate it from an inbuilt formula. — Sarah, Jul 06 '17 at 12:01

Scott Boston · Answer 1 · 2017-07-08T06:36:23.027

0

If your dataframes are to big you can use this trick.

Do a full cartisian product join of all records to all records:

df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')

Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end

(df_out.query('week_start < purchased_at < week_end')
       .drop(['key','week_start','week_end'], axis=1))

Output:

   purchased_at  product_id cost  week_no 
0    2017-01-01           1  £10        1 
3    2017-01-01           2   £8        1 
7    2017-01-09           1  £10        2 
11   2017-01-18           3  £12        3

If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.

a = df1.purchased_at.values

bh = df2.week_end.values

bl = df2.week_start.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([df1.values[i], df2.values[j]]),
    columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)

Output:

          purchased_at product_id cost week_no
0  2017-01-01 00:00:00          1  £10       1
1  2017-01-01 00:00:00          2   £8       1
2  2017-01-09 00:00:00          1  £10       2
3  2017-01-18 00:00:00          3  £12       3

edited Jul 08 '17 at 06:36

answered Jul 08 '17 at 06:24

Scott Boston

147,308
15
139
187

The numpy method looks really useful, however the full 'week_no' output is replicated for each of the 'purchased at' date, i.e. instead of the output above, I have 16 lines: purchased_at product_id ... week_no 2017-01-01 1 1 2017-01-01 1 1 2017-01-01 1 2 2017-01-01 1 3 2017-01-01 2 1 2017-01-01 2 1 ... My code looks to match your example, do you have any thoughts on where this has gone wrong? – Sarah Jul 10 '17 at 12:07
apologies, that's not come out clearly. To summarise, the 'purchased_at', 'product_id' and 'cost' rows are each replicated four times, with the week_no output as [1,1,2,3,1,1,2,3,1,1,2,3,1,1,2,3] – Sarah Jul 10 '17 at 12:09
Sarah... You are going to have to give me the data and a expected outputs for me to troubleshoot. My only guess is that maybe we are creating a cartisan product with our join and there needs to be a additional constraint such as product_id in the code. – Scott Boston Jul 10 '17 at 12:38
Scott, thanks for your comment and help so far. I've edited my original post to add in all headers for the dataframe df1. df2 is as originally specified. The expected output is still to add a column to df1 with the corresponding week_no based on the purchase_date. If there's anything else you need please let me know. – Sarah Jul 10 '17 at 15:03

score 0 · Answer 2 · answered Jul 08 '17 at 07:50

You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:

import pandas as pd

data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}

df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])

def getWeekNo(date, year0):
    datetime = pd.to_datetime(date, dayfirst=True)
    year = int(datetime.strftime('%Y'))
    weekNo = int(datetime.strftime('%U'))
    return weekNo + 52*(year-year0)

df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))

Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).

This way, you don't need to maintain a seperate DataFrame only for week numbers.

Python selecting row from second dataframe based on complex criteria

2 Answers2