1

I have 2 dataframes:

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G

how to merge this 2 dataframes in the condition of:

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

The output should be:

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None

I know one method to do this,but it is very slow:

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

    df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

Any faster solution?

William
  • 3,724
  • 9
  • 43
  • 76

3 Answers3

1

try those:

df2 = df2.merge(df1, left_on='RB', right_on='RB', how='inner')
df2 = df2[(df2['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df2['EndDate']]
Da Song
  • 548
  • 2
  • 7
1

You can try to use sql, because in pandas its more complicated:

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')

df_1.to_sql('A', conn, index=False)
df_2.to_sql('B', conn, index=False)

qry = '''
    select  
        B.RB, B.IssueDate, B.gs, A.Valindex0
    from
        B left join A on
        (B.IssueDate between A.BeginDate and A.EndDate and B.RB = A.RB)
    '''
df = pd.read_sql_query(qry, conn)

#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8        NaN
# 1  00   19820101  G       47.0
# 2  48   19820101  G        NaN
# 3  50   19820101  G        NaN
# 4  50   19820101  G        NaN
Andreas
  • 8,694
  • 3
  • 14
  • 38
1

Solution

Uses: comparison with pd.Series.between + method chaining with pd.DataFrame.pipe

You can try this.

Note: I have used a slightly more generic dataset (df1, df2) to see that it works for all RB values.

What does this solution offer you?

  • Merging (inner-join) dataframes df1 and df2
  • A convenience function update_column to use pandasDataFrame.pipe:
    • This evaluates the condition BeginDate <= IssueDate <= EndDate
    • And assigns None value to any row where the condition evaluates to False.
    • If you check the output dataframe at this point you will be able verify if the logic was properly implemented or not, as the columns BeginDate and EndDate are still available.
  • Drop unnecessary columns (BeginDate and EndDate) to get your final result.

Code

import pandas as pd

def update_column(df: pd.DataFrame, target_column: str="Valindex0"):
    cond = df["IssueDate"].between(df["BeginDate"], df["EndDate"])
    df.loc[~cond, target_column] = None
    return df

# evalute result
result = (df2
    .merge(df1, how='inner', on="RB")                ## merge dataframes on column "RB"
    .pipe(update_column, target_column="Valindex0")  ## using piping for custom logic
    .drop(columns=["BeginDate", "EndDate"])          ## drop unnecessary columns
)

## Output: result
#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8       51.0
# 1  L3   19990201  8       50.0
# 2  00   19820101  G        NaN
# 3  00   19820101  G        NaN
# 4  00   19820101  G        NaN
# 5  00   19820101  G        NaN
# 6  48   19820101  G       58.0
# 7  50   19870101  G       52.0
# 8  50   19820121  G        NaN

Output

This is the output of the resulting dataframe, before dropping columns BeginDate and EndDate.

enter image description here

Dummy Data

Load dataframe df1.

import pandas as pd
from io import StringIO

df1s = """
RB  BeginDate   EndDate    Valindex0
00  19000120    19801231    45
00  19820110    19841229    47
00  19850101    20010629    50
00  20010701    99991230    39
L3  19850101    20450630    51
L3  19850111    20010609    50
50  19850121    20010619    52
48  19810204    20010699    58
"""

df1 = pd.read_csv(StringIO(df1s.strip()), sep='\s+', 
                  dtype={"RB": str, "BeginDate": int, "EndDate": int})

Load dataframe df2.

import pandas as pd
from io import StringIO

df2s = """
RB  IssueDate   gs
L3  19990201    8
00  19820101    G
48  19820101    G
50  19870101    G
50  19820121    G
"""

df2 = pd.read_csv(StringIO(df2s.strip()), sep='\s+', 
                  dtype={"RB": str, "IssueDate": int})
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • @William Here you go. – CypherX Jul 22 '21 at 02:24
  • Thank you very much for your reply,but the output should be the same length as df2,but now it is the same length as df1. – William Jul 22 '21 at 14:46
  • That's because of the logic you stated. If there are multiple lines in `df1` for a given `RB` value in `df2`, then how do you map them? Which `BeginDate` and `EndDate` values do you use? As I see it, your logic here does not cover the scenario. So, I would request you to take the demo data that I used (as it is more generic than the one you shared), and explain how you will get a single line for each row in `df2` and how the total number of rows in `result` will be same as that in `df2`. Once you have done this (use simple plain pen and paper), then you can explain your logic to us again. – CypherX Jul 22 '21 at 21:32
  • Hi friend thank you so much for you help and I have updated my question here https://stackoverflow.com/questions/68806043/how-to-merge-or-combine-1-pandas-dataframe-to-another-one-based-on-multiple-cond – William Aug 16 '21 at 16:09