-1

I would like to merge two data frames (how=left) but not only on an index but only on a condition.

E.g assume two data frame

      C1 C2 
  A = I  3  
      K  2  
      L  5

      C1 C2 C3
  B = I  5  T
      I  0  U
      K  1  X     
      L  7  Z

Now I would like to left outer join table A with B using index C1 under the condition that A.C2 > B.C2. That is, the final result should look like

       A.C1  A.C2 B.C2 B.C3
A<-B = I        1    0    U
       K        2    1    X
       L        5    Null Null

P.S.: If you want to test it your self:

import pandas as pd

df_A = pd.DataFrame([], columns={'C 1', 'C2'})
df_A['C 1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]

df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']


Lazloo Xp
  • 858
  • 1
  • 11
  • 36
  • Show us your code, please. – Alfe Sep 10 '19 at 14:44
  • the link should help you ! - https://stackoverflow.com/questions/53549492/joining-two-pandas-dataframes-based-on-multiple-conditions – Shiva Prakash Sep 10 '19 at 14:49
  • @ShivaPrakash: No there is no conditional merge mentioned in your suggested link – Lazloo Xp Sep 11 '19 at 06:30
  • Take a look at this [Pandas: Join dataframe with condition ](https://stackoverflow.com/questions/44080248/pandas-join-dataframe-with-condition). – CypherX Sep 11 '19 at 06:39
  • Some other relevant links: (1) https://stackoverflow.com/questions/23508351/how-to-do-a-conditional-join-in-python-pandas, (2) https://github.com/pandas-dev/pandas/issues/7480. – CypherX Sep 11 '19 at 06:46
  • @CypherX: In your first link, the rows are completely discarded after the merge. I want to merge only parts from table B. That makes a difference – Lazloo Xp Sep 11 '19 at 06:47
  • @LazlooXp These links are only there for you to get some guidance/ideas. In most cases they will not give you an exact solution. What you want is `conditional merge in pandas dataframes`. – CypherX Sep 11 '19 at 06:54
  • @LazlooXp Why does your expected output have `A.C2 = [1, 2]` instead of `[3, 2]`? – CypherX Sep 11 '19 at 08:08

3 Answers3

0

The quick and dirty solution would be to simply join on the C1 column and then put NULL or NaN into C3 for all the rows where C2_1 > C2_2.

weiqiman
  • 1
  • 2
  • I expand the description of the problem. As you can see now, just replacing values with does unfortunately not help as a further problem is the duplication of rows. – Lazloo Xp Sep 11 '19 at 06:54
0

I found one non-pandas-native solution:

import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K', 'L']
df_A['C2'] = [3, 2, 5]
cols = df_A.columns
cols = cols.map(lambda x: x.replace(' ', '_'))
df_A.columns = cols

df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K', 'L']
df_B['C2'] = [5, 0, 2, 7]
df_B['C3'] = ['T', 'U', 'X', 'Z']


# df_merge = pd.merge(left=df_A, right=df_B, how='left', on='C1')

df_sql = pysqldf("""
select *
from df_A t_1
left join df_B t_2 on t_1.C1 = t_2.C1 and t_1.C2 >= t_2.C2
;
""")

However, for big tables, pandasql turns out to be less performant.

Output:

   C2 C1    C3   C2    C1
0   3  I     U  0.0     I
1   2  K     X  2.0     K
2   5  L  None  NaN  None
Lazloo Xp
  • 858
  • 1
  • 11
  • 36
0

Method: direct SQL query into pandas using pandasql library. reference

import pandas as pd

df_A = pd.DataFrame([], columns={'C1', 'C2'})
df_A['C1'] = ['I', 'K']
df_A['C2'] = [3, 2]

df_B = pd.DataFrame([], columns={'C1', 'C2', 'C3'})
df_B['C1'] = ['I', 'I', 'K']
df_B['C2'] = [5, 0, 2]
df_B['C3'] = ['T', 'U', 'X']

It appears to me that the conditions that you specified for doing the outer join on (A.C1 = B.C1) does not produce the expected result. I needed to do GROUP BY A.C1 in order to drop duplicate rows having same values in A.C1 after the join.

import pandasql as ps

q = """
SELECT A.C1 as 'A.C1', 
       A.C2 as 'A.C2', 
       B.C2 as 'B.C2', 
       B.C3 as 'B.C3' 
    FROM df_A AS A    
    LEFT OUTER JOIN df_B AS B 
    --ON A.C1 = B.C1 AND A.C2 = B.C2
    WHERE A.C2 > B.C2
    GROUP BY A.C1
"""
print(ps.sqldf(q, locals()))

Output

  A.C1  A.C2  B.C2 B.C3
0    I     3     2    X
1    K     2     0    U

Other References

  1. https://www.zentut.com/sql-tutorial/sql-outer-join/
  2. Executing an SQL query over a pandas dataset
  3. How to do a conditional join in python Pandas?
  4. https://github.com/pandas-dev/pandas/issues/7480
  5. https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e
CypherX
  • 7,019
  • 3
  • 25
  • 37
  • What is still missing is the behavior when there is no match in table B (either because C1 is not matching or the condition on C2 is not fulfilled). In these cases, the should not be deleted but the right side should be filled with nulls. I will adapt the description accordingly later. THX for now – Lazloo Xp Sep 11 '19 at 08:42