0

I am searching something the equivalent of a SQL merge using where "t1.A = t2.A OR t1.B = t2.A" OR t1.C = t2.A. I have two data frames say D1 with A, B, C, D, E as columns and D2, where few records of D2 can be pulled by A column of D1, few are from its alias B, C, D and E columns.

I tried as below but it was giving me the wrong output.

sample = D1.merge(D2,left_on=[ 'A' or'B' or'C'or 'D' or E], 
right_on=['A'], how='left')

Then I tried

sample = pd.concat([D1.merge(D2,left_on='A', right_on= 'A', how='left'), 
D1.merge(D2,left_on='B', right_on='A', how='left'), D1.merge(D2, 
left_on='C',right_on='A', how='left'),D1.merge(D2,left_on='D', 
right_on='A', how='left'),D1.merge(D2,left_on='E', right_on='A', 
how='left')])

This is giving me a lot of duplicates I tried to remove duplicate but unfortunately, it didn't work out.

dupes = (sample['A'] == sample['B']) == (sample['C'] == sample['D']) == 
sample['E']   
sample=sample.loc[~dupes]



ValueError: The truth value of a Series is ambiguous. Use a.empty, 
 a.bool(), a.item(), a.any() or a.all().

I need the output or 'sample' records to be same as records of data frame D1.

Erfan
  • 40,971
  • 8
  • 66
  • 78
PCH
  • 43
  • 5
  • In this case its crucial to see what your data looks like. Could you add example data so we can visually see what you are trying to do. For more information, look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Erfan Jun 17 '19 at 11:52
  • you could set those columns as index and do a usual merge? – Quang Hoang Jun 17 '19 at 12:42

1 Answers1

0

Let's start from import itertools (we will use it).

I created the test DataFrames as follows:

D1 = pd.DataFrame(data=[
    [ 1, 0, 0, 0, 0, 91 ],
    [ 0, 2, 0, 0, 0, 92 ],
    [ 0, 0, 3, 0, 0, 93 ],
    [ 0, 0, 0, 4, 0, 94 ],
    [ 0, 0, 0, 0, 5, 95 ],
    [ 0, 6, 0, 0, 0, 96 ],
    [ 0, 0, 7, 0, 0, 97 ]], columns=list('ABCDEF'))

D2 = pd.DataFrame(data=[
    [ 1, 71, 89 ],
    [ 2, 72, 88 ],
    [ 3, 73, 87 ],
    [ 4, 74, 86 ],
    [ 5, 75, 85 ],
    [ 8, 76, 84 ]], columns=list('AXY'))

As you can see:

  • D1 contains "join candidate" columns A thru E and one additional column (F),
  • D2 contains one join column A and two additional columns (X and Y).

Then let's define the join function:

def myJoin(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())
    df = pd.DataFrame(left.append(right.iloc[1:])
        for (_, left), (_, right) in rows
            if right.A in left.loc['A':'E'].tolist())
    return df.reset_index(drop=True)

And the only thing to do is to call it:

myJoin(D1, D2)

The result is:

   A  B  C  D  E   F   X   Y
0  1  0  0  0  0  91  71  89
1  0  2  0  0  0  92  72  88
2  0  0  3  0  0  93  73  87
3  0  0  0  4  0  94  74  86
4  0  0  0  0  5  95  75  85

Note that column names taken from both DataFrames should be unique, so I eliminated A column from D2 (right.iloc[1:]).

Edit

The function presented above does actually inner join. If you want left join, then define another join function as:

def myJoin2(df1, df2):
    res = []
    for (_, left) in df1.iterrows():
        found = False
        for (_, right) in df2.iterrows():
            if right.A in left.loc['A':'E'].tolist():
                res.append(left.append(right.iloc[1:]))
                found = True
        if not found:
            res.append(left)
    df = pd.DataFrame(res)
    return df.reset_index(drop=True)

and call it:

myJoin2(D1, D2)

getting the result:

     A    B    C    D    E     F     X     Y
0  1.0  0.0  0.0  0.0  0.0  91.0  71.0  89.0
1  0.0  2.0  0.0  0.0  0.0  92.0  72.0  88.0
2  0.0  0.0  3.0  0.0  0.0  93.0  73.0  87.0
3  0.0  0.0  0.0  4.0  0.0  94.0  74.0  86.0
4  0.0  0.0  0.0  0.0  5.0  95.0  75.0  85.0
5  0.0  0.0  0.0  0.0  5.0  95.0  76.0  84.0
6  0.0  6.0  0.0  0.0  0.0  96.0   NaN   NaN
7  0.0  0.0  7.0  0.0  0.0  97.0   NaN   NaN

The downside is that int values are converted to float, but as NaN is also a special case of float, it can't be avoided.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41