Populate a column in a df using another df as a lookup table

Question

I have 2 dataframes, df1 and df2.

df1 (the lookup)    
group_id       date     value
0    105716  1/30/2019    Soccer
1    105717  1/30/2019  Football
2    105718  1/30/2019      Rest
3    105719  1/30/2019    Soccer
4    105716  1/31/2019      Rest
5    105717  1/31/2019      Rest
6    105718  02/01/2019  Football
7    105719  02/01/2019    Soccer
8    105719  02/02/2019    Tennis
8    105722  02/03/2019    Tennis

df2 (the parent)
    GROUP_ID  STARTDATE    ENDDATE
0     105716  1/30/2019  1/30/2019
1     105717  1/30/2019  1/30/2019
2     105718  1/30/2019  1/30/2019
3     105719  1/30/2019  1/30/2019
4     105716  1/30/2019  1/31/2019
5     105717  1/31/2019  1/31/2019
6     105718  1/31/2019  1/31/2019
7     105719  1/31/2019  1/31/2019
8     105716  1/31/2019  1/31/2019
9     105717  1/31/2019  1/31/2019
10    105718  1/31/2019  1/31/2019
11    105719  1/31/2019   2/1/2019
12    105716   2/1/2019   2/1/2019
13    105717   2/1/2019   2/1/2019
14    105718   2/1/2019   2/1/2019
15    105719   2/1/2019   2/1/2019
16    105716   2/1/2019   2/1/2019
17    105717   2/1/2019   2/1/2019
18    105718   2/1/2019   2/1/2019
19    105719   2/1/2019   2/1/2019
20    105716   2/1/2019   2/2/2019
21    105717   2/2/2019   2/2/2019
22    105718   2/2/2019   2/2/2019
23    105719   2/2/2019   2/2/2019
24    105716   2/2/2019   2/2/2019
25    105717   2/2/2019   2/2/2019
26    105718   2/2/2019   2/2/2019
27    105719   2/2/2019   2/3/2019
28    105716   2/3/2019   2/3/2019
29    105722   2/3/2019   2/3/2019

    df2 (the output)
    GROUP_ID  STARTDATE    ENDDATE    VALUE
0     105716  1/30/2019  1/30/2019    Soccer
1     105717  1/30/2019  1/30/2019    Football
2     105718  1/30/2019  1/30/2019    Rest
3     105719  1/30/2019  1/30/2019    Soccer
4     105716  1/30/2019  1/31/2019    Rest
5     105717  1/31/2019  1/31/2019    Rest
6     105718  1/31/2019  1/31/2019    None
7     105719  1/31/2019  1/31/2019    None
8     105716  1/31/2019  1/31/2019    Rest
9     105717  1/31/2019  1/31/2019    Rest
10    105718  1/31/2019  1/31/2019    None
11    105719  1/31/2019   2/1/2019    None
12    105716   2/1/2019   2/1/2019    None
13    105717   2/1/2019   2/1/2019    None
14    105718   2/1/2019   2/1/2019    Football
15    105719   2/1/2019   2/1/2019    Soccer
16    105716   2/1/2019   2/1/2019    None
17    105717   2/1/2019   2/1/2019    None
18    105718   2/1/2019   2/1/2019    Football
19    105719   2/1/2019   2/1/2019    Soccer
20    105716   2/1/2019   2/2/2019    None
21    105717   2/2/2019   2/2/2019    None
22    105718   2/2/2019   2/2/2019    None
23    105719   2/2/2019   2/2/2019    Tennis
24    105716   2/2/2019   2/2/2019    None
25    105717   2/2/2019   2/2/2019    None
26    105718   2/2/2019   2/2/2019    None
27    105719   2/2/2019   2/3/2019    None
28    105716   2/3/2019   2/3/2019    None
29    105722   2/3/2019   2/3/2019    Tennis

I am trying to add a RESULT field to df2 and populate it with value from df1 where GROUP_ID = group_id and date is between STARTDATE and ENDDATE while keeping all the rows in df2. Any Nan\Null values will be set to 'None'. I can do this with a loop but it takes a while to go through everything what I tried was numpy.where()

df2['RESULT'] = 'None'
df2.result = np.where(((df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))), df1.value, 'None')

and vectorized approach

df2.result = df1.value[(df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))]

and a merge approach

df_activity = pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')[((pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['STARTDATE'] <= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id').date)&(pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['ENDDATE'] >= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['date']))]

The first 2 I tried give me an error

ValueError: Can only compare identically-labeled DataFrame objects

the merge works but it also drops all the rows that don't match. I think I can work around this with another merge but if there is a faster more streamlined way I would like to use it.

Instead of going through the trouble of describing all of your data, it would be much easier and more useful to post samples of each data frame — Chris, Jan 17 '23 at 14:30

score 1 · Answer 1 · answered Jan 17 '23 at 14:31

1

merge first on group_id then query to filter out your list by dates:

out = (df2.merge(df1, on='group_id')
          .query("(startdate <= date) & (date <= enddate)"))
print(out)

# Output:
   group_id  startdate    enddate       date  value
0         1 2023-01-15 2023-01-20 2023-01-17    100

Input dataframes:

>>> df1
   group_id       date  value
0         1 2023-01-17    100
1         1 2023-01-30    200

>>> df2
   group_id  startdate    enddate
0         1 2023-01-15 2023-01-20

answered Jan 17 '23 at 14:31

Corralien

109,409
8
28
52

1

I had to change the join because the field names in one df are caps and in the other lower, but the approach worked. jdf2 = pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id').query("(STARTDATE <= date) & (date <= ENDDATE)") – MrKingsley Jan 18 '23 at 13:20
Sorry, it actually doesn't work, it drops anything that doesn't match from the original dataframe and I need to keep those. Although I suppose I could join it back to the original df and populate column that way... – MrKingsley Jan 18 '23 at 15:48
Can you provide the expected output for your two dataframes please? – Corralien Jan 18 '23 at 17:36
I added an expected result but I had to do it manually; I also made a minor change to the look up so that there would be more cross references. The real dataframe has 10's of thousands of rows in df2 and hundreds of rows in df1. – MrKingsley Jan 19 '23 at 14:29
I ended up working around this by doing the join to the source data then building df2. df2 is made by comparing pairs of rows from the source to get the start and end dates. I figured if there was a mismatch on the result field I just need to take the result from the row with the oldest date. – MrKingsley Jan 19 '23 at 17:09

score 0 · Answer 2 · answered Feb 02 '23 at 14:50

It seems this question is a duplicate, or at the least closely related to this one here here. From what I gained engaging with the member I implemented this solution.

import pandas as pd

def lookup_value(groupid, date1, date2, lookup):
    if len([x for x in lookup[lookup['group_id'].astype(int) == int(groupid)]['date'] if pd.to_datetime(x) >= pd.to_datetime(date1) and pd.to_datetime(x) <= pd.to_datetime(date2)])==0:
        return ['None']
    else:
        match = ((lookup['group_id'].astype(int) == int(groupid)) & (pd.to_datetime(lookup['date'])>=pd.to_datetime(date1)) & (pd.to_datetime(lookup['date'])<=pd.to_datetime(date2)))
        value = lookup['value'][match]
        return value.values

if __name__ == '__main__':
    lookup_cols = ['group_id','date','value']
    lookup_data = [['105716','1/30/2019','Soccer'],
    ['105717','1/30/2019','Football'],
    ['105718','1/30/2019','Rest'],
    ['105719','1/30/2019','Soccer'],
    ['105716','1/31/2019','Rest'],
    ['105717','1/31/2019','Rest'],
    ['105718','02/01/2019','Football'],
    ['105719','02/01/2019','Soccer'],
    ['105719','02/02/2019','Tennis'],
    ['105722','02/03/2019','Tennis']]

    parent_cols = ['GROUP_ID','STARTDATE','ENDDATE']
    parent_data = [['105716','1/30/2019','1/30/2019'],
    ['105717','1/30/2019','1/30/2019'],
    ['105718','1/30/2019','1/30/2019'],
    ['105719','1/30/2019','1/30/2019'],
    ['105716','1/30/2019','1/31/2019'],
    ['105717','1/31/2019','1/31/2019'],
    ['105718','1/31/2019','1/31/2019'],
    ['105719','1/31/2019','1/31/2019'],
    ['105716','1/31/2019','1/31/2019'],
    ['105717','1/31/2019','1/31/2019'],
    ['105718','1/31/2019','1/31/2019'],
    ['105719','1/31/2019','2/1/2019'],
    ['105716','2/1/2019','2/1/2019'],
    ['105717','2/1/2019','2/1/2019'],
    ['105718','2/1/2019','2/1/2019'],
    ['105719','2/1/2019','2/1/2019'],
    ['105716','2/1/2019','2/1/2019'],
    ['105717','2/1/2019','2/1/2019'],
    ['105718','2/1/2019','2/1/2019'],
    ['105719','2/1/2019','2/1/2019'],
    ['105716','2/1/2019','2/2/2019'],
    ['105717','2/2/2019','2/2/2019'],
    ['105718','2/2/2019','2/2/2019'],
    ['105719','2/2/2019','2/2/2019'],
    ['105716','2/2/2019','2/2/2019'],
    ['105717','2/2/2019','2/2/2019'],
    ['105718','2/2/2019','2/2/2019'],
    ['105719','2/2/2019','2/3/2019'],
    ['105716','2/3/2019','2/3/2019'],
    ['105722','2/3/2019','2/3/2019']]

    parent_df = pd.DataFrame(data=parent_data, columns=parent_cols)
    lookup_df = pd.DataFrame(data=lookup_data, columns=lookup_cols)

    parent_df['VALUE'] = parent_df.apply(lambda x: lookup_value(x['GROUP_ID'], x['STARTDATE'], x['ENDDATE'], lookup_df)[0], axis=1)
    print(parent_df)

Output:

GROUP_ID  STARTDATE    ENDDATE     VALUE
0    105716  1/30/2019  1/30/2019    Soccer
1    105717  1/30/2019  1/30/2019  Football
2    105718  1/30/2019  1/30/2019      Rest
3    105719  1/30/2019  1/30/2019    Soccer
4    105716  1/30/2019  1/31/2019    Soccer
5    105717  1/31/2019  1/31/2019      Rest
6    105718  1/31/2019  1/31/2019      None
7    105719  1/31/2019  1/31/2019      None
8    105716  1/31/2019  1/31/2019      Rest
9    105717  1/31/2019  1/31/2019      Rest
10   105718  1/31/2019  1/31/2019      None
11   105719  1/31/2019   2/1/2019    Soccer
12   105716   2/1/2019   2/1/2019      None
13   105717   2/1/2019   2/1/2019      None
14   105718   2/1/2019   2/1/2019  Football
15   105719   2/1/2019   2/1/2019    Soccer
16   105716   2/1/2019   2/1/2019      None
17   105717   2/1/2019   2/1/2019      None
18   105718   2/1/2019   2/1/2019  Football
19   105719   2/1/2019   2/1/2019    Soccer
20   105716   2/1/2019   2/2/2019      None
21   105717   2/2/2019   2/2/2019      None
22   105718   2/2/2019   2/2/2019      None
23   105719   2/2/2019   2/2/2019    Tennis
24   105716   2/2/2019   2/2/2019      None
25   105717   2/2/2019   2/2/2019      None
26   105718   2/2/2019   2/2/2019      None
27   105719   2/2/2019   2/3/2019    Tennis
28   105716   2/3/2019   2/3/2019      None
29   105722   2/3/2019   2/3/2019    Tennis

Process finished with exit code 0

Populate a column in a df using another df as a lookup table

2 Answers2