how to merge 2 pandas daataframes base on multiple conditions faster

Question

I have 2 dataframes:

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G

how to merge this 2 dataframes in the condition of:

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

The output should be:

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None

I know one method to do this,but it is very slow:

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

    df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

Any faster solution?

score 1 · Answer 1 · answered Jul 21 '21 at 21:05

1

try those:

df2 = df2.merge(df1, left_on='RB', right_on='RB', how='inner')
df2 = df2[(df2['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df2['EndDate']]

answered Jul 21 '21 at 21:05

Da Song

548
2
7

df2 = df2[(df1['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df1['EndDate']] this one ? – William Jul 21 '21 at 21:06
they are merged into df2 already, so just df2 should be fine – Da Song Jul 22 '21 at 01:57

Andreas · Accepted Answer · 2021-08-17T16:39:00.547

You can try to use sql, because in pandas its more complicated:

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')

df_1.to_sql('A', conn, index=False)
df_2.to_sql('B', conn, index=False)

qry = '''
    select  
        B.RB, B.IssueDate, B.gs, A.Valindex0
    from
        B left join A on
        (B.IssueDate between A.BeginDate and A.EndDate and B.RB = A.RB)
    '''
df = pd.read_sql_query(qry, conn)

#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8        NaN
# 1  00   19820101  G       47.0
# 2  48   19820101  G        NaN
# 3  50   19820101  G        NaN
# 4  50   19820101  G        NaN

CypherX · Answer 3 · 2021-07-22T02:31:42.323

Solution

Uses: comparison with pd.Series.between + method chaining with pd.DataFrame.pipe

You can try this.

Note: I have used a slightly more generic dataset (df1, df2) to see that it works for all RB values.

What does this solution offer you?

Merging (inner-join) dataframes df1 and df2
A convenience function update_column to use pandasDataFrame.pipe:
- This evaluates the condition BeginDate <= IssueDate <= EndDate
- And assigns None value to any row where the condition evaluates to False.
- If you check the output dataframe at this point you will be able verify if the logic was properly implemented or not, as the columns BeginDate and EndDate are still available.
Drop unnecessary columns (BeginDate and EndDate) to get your final result.

Code

import pandas as pd

def update_column(df: pd.DataFrame, target_column: str="Valindex0"):
    cond = df["IssueDate"].between(df["BeginDate"], df["EndDate"])
    df.loc[~cond, target_column] = None
    return df

# evalute result
result = (df2
    .merge(df1, how='inner', on="RB")                ## merge dataframes on column "RB"
    .pipe(update_column, target_column="Valindex0")  ## using piping for custom logic
    .drop(columns=["BeginDate", "EndDate"])          ## drop unnecessary columns
)

## Output: result
#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8       51.0
# 1  L3   19990201  8       50.0
# 2  00   19820101  G        NaN
# 3  00   19820101  G        NaN
# 4  00   19820101  G        NaN
# 5  00   19820101  G        NaN
# 6  48   19820101  G       58.0
# 7  50   19870101  G       52.0
# 8  50   19820121  G        NaN

Output

This is the output of the resulting dataframe, before dropping columns BeginDate and EndDate.

Dummy Data

Load dataframe df1.

import pandas as pd
from io import StringIO

df1s = """
RB  BeginDate   EndDate    Valindex0
00  19000120    19801231    45
00  19820110    19841229    47
00  19850101    20010629    50
00  20010701    99991230    39
L3  19850101    20450630    51
L3  19850111    20010609    50
50  19850121    20010619    52
48  19810204    20010699    58
"""

df1 = pd.read_csv(StringIO(df1s.strip()), sep='\s+', 
                  dtype={"RB": str, "BeginDate": int, "EndDate": int})

Load dataframe df2.

import pandas as pd
from io import StringIO

df2s = """
RB  IssueDate   gs
L3  19990201    8
00  19820101    G
48  19820101    G
50  19870101    G
50  19820121    G
"""

df2 = pd.read_csv(StringIO(df2s.strip()), sep='\s+', 
                  dtype={"RB": str, "IssueDate": int})

Thank you very much for your reply,but the output should be the same length as df2,but now it is the same length as df1. — William, Jul 22 '21 at 14:46
That's because of the logic you stated. If there are multiple lines in `df1` for a given `RB` value in `df2`, then how do you map them? Which `BeginDate` and `EndDate` values do you use? As I see it, your logic here does not cover the scenario. So, I would request you to take the demo data that I used (as it is more generic than the one you shared), and explain how you will get a single line for each row in `df2` and how the total number of rows in `result` will be same as that in `df2`. Once you have done this (use simple plain pen and paper), then you can explain your logic to us again. — CypherX, Jul 22 '21 at 21:32
Hi friend thank you so much for you help and I have updated my question here https://stackoverflow.com/questions/68806043/how-to-merge-or-combine-1-pandas-dataframe-to-another-one-based-on-multiple-cond — William, Aug 16 '21 at 16:09

how to merge 2 pandas daataframes base on multiple conditions faster

3 Answers3

Solution

Code

Output

Dummy Data

Linked