How to merge or combine 1 pandas dataframe to another one based on multiple conditions

Question

I have 2 dataframes:

df1 and df2 ,df1 is used to be a reference or lookups file for df2.It means we need use each row of df1 to match with each row of df2 and then merge df1 into df2 and then out put new df2.

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G
5   00  19860101    G
6   52  19820101    G
7   53  19820101    G
8   00  19500201    G

how to merge this 2 dataframes base on condition:

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

Notice the final out put is merge df1 into df2,because df1 is just like a reference or lookup file for df2.It means we need use each row of df1 to match with each row of df2,then output new df2

The output should looks like:

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None
5   00  19860101    G   50
6   52  19820101    G   None
7   53  19820101    G   None
8   00  19500201    G   45

I know one method to do this,but it is extremally slow, especially when the length of d1 is big:

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

Any faster solution?

score 2 · Answer 1 · answered Aug 16 '21 at 17:06

Use IntervalIndex -

idx = pd.IntervalIndex.from_arrays(df1['BeginDate'],df1['EndDate'],closed='both')
for x in df1['RB'].unique():
    mask = df2['RB']==x
    df2.loc[mask, 'Valindex0'] = df1.loc[idx.get_indexer(df2.loc[mask, 'IssueDate']), 'Valindex0'].values

Output

   RB  IssueDate gs  Valindex0
0  L3   19990201  8        NaN
1  00   19820101  G       47.0
2  48   19820101  G        NaN
3  50   19820101  G        NaN
4  50   19820101  G        NaN
5  00   19860101  G       50.0
6  52   19820101  G        NaN
7  53   19820101  G        NaN
8  00   19500201  G       45.0

Thank you for your reply,may I ask what if I have more columns need to check for example a column named 'type',are in both dataframes and just like column RB ,df1['type']=df2['type'],can I use for x in df1[['RB','type']].unique():...? — William, Aug 17 '21 at 05:13

fthomson · Answer 2 · 2021-08-16T16:49:29.360

1

If you dont care about the ordering of the records or index you could do this

df1 = pd.read_clipboard()
df2 = pd.read_clipboard()

# pandas wants to cast this column as int
df1['RB'] = '00'

new_df = df2.merge(df1, how='outer', on='RB')

mask = ((new_df['BeginDate'] <= new_df['IssueDate']) & (new_df['IssueDate'] <= new_df['EndDate'])
       )| new_df['Valindex0'].isnull()

new_df[['RB','IssueDate', 'gs', 'Valindex0']][mask]

The idea is to do a full outer join first and then filter the dataset after.

Results:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   NaN
2   00  19820101    G   47.0
7   00  19860101    G   50.0
9   00  19500201    G   45.0
13  48  19820101    G   NaN
14  50  19820101    G   NaN
15  50  19820101    G   NaN
16  52  19820101    G   NaN
17  53  19820101    G   NaN

edited Aug 16 '21 at 16:49

answered Aug 16 '21 at 16:43

fthomson

773
3
9

Thank you for your reply,I have a question,can I don't defin df1['RB'] = '00' because RB is a dynamic value，I can't define,it not always equals to 00 ,it also may equals to 48 or 52 or anyother numbers – William Aug 16 '21 at 16:58
Hi @William you do not need to define ```df1['RB'] = '00'```. When I copied your dataset, pandas converted that column to int which changed all of the values from '00' to '0' . That was simply to replicate your dataset – fthomson Aug 16 '21 at 17:14

How to merge or combine 1 pandas dataframe to another one based on multiple conditions

2 Answers2

Linked