2

I am attempting to merge 2 dataframes in pandas based on a common column. Reading the documentation on merging, everything I am doing implicates that the dataframes should merge on the columns specified, while only keeping common values. It certainly keeps the common values, but produces a massive dataframe that is essentially an entire "right" dataframe for every index in the "left" dataframe.

My left dataframe is

left
Out[325]: 
     A   B   C   D MergeThis
0   38  43  20  34       -N2
1   14  44  23  91       -N2
2   55  50  30  92       -N2
3   29  20  56   0       -N2
4   61   6  38  44       -N2
5   18  61  30   7       -N2
6   67  45  60  81       -N2
7    7  53  60  72       +N2
8   15  46   0  82       +N2
9    6  90  83  36       +N2
10   0  87  64  82       +N2
11  13  22  13  39       +N2
12  65  88  50  34       +N2
13  35  31  72  34       +N2

and my right data frame is

right
Out[326]: 
    E   F   G   H MergeThis
0  22  81  97  28       +N2
1  20   4  44  48       +N2
2  19  41  52  25       +N2
3  30  37  66  24       +N2
4  32  40  56  92       +N2
5  95  76  91  31       +N2
6  36  66  17  85       +N2

Obviously, I am attempting to merge on the "MergeThis" column, such that I end up with the +N2 values combined, with the columns from both "left" and "right". However, what I end up with is something strange:

left.merge(right, on='MergeThis')
Out[327]: 
     A   B   C   D MergeThis   E   F   G   H
0    7  53  60  72       +N2  22  81  97  28
1    7  53  60  72       +N2  20   4  44  48
2    7  53  60  72       +N2  19  41  52  25
3    7  53  60  72       +N2  30  37  66  24
4    7  53  60  72       +N2  32  40  56  92
5    7  53  60  72       +N2  95  76  91  31
6    7  53  60  72       +N2  36  66  17  85
7   15  46   0  82       +N2  22  81  97  28
8   15  46   0  82       +N2  20   4  44  48
9   15  46   0  82       +N2  19  41  52  25
10  15  46   0  82       +N2  30  37  66  24
11  15  46   0  82       +N2  32  40  56  92
12  15  46   0  82       +N2  95  76  91  31
13  15  46   0  82       +N2  36  66  17  85
14   6  90  83  36       +N2  22  81  97  28
15   6  90  83  36       +N2  20   4  44  48
16   6  90  83  36       +N2  19  41  52  25
17   6  90  83  36       +N2  30  37  66  24
18   6  90  83  36       +N2  32  40  56  92
19   6  90  83  36       +N2  95  76  91  31
20   6  90  83  36       +N2  36  66  17  85
21   0  87  64  82       +N2  22  81  97  28
22   0  87  64  82       +N2  20   4  44  48
23   0  87  64  82       +N2  19  41  52  25
24   0  87  64  82       +N2  30  37  66  24
25   0  87  64  82       +N2  32  40  56  92
26   0  87  64  82       +N2  95  76  91  31
27   0  87  64  82       +N2  36  66  17  85
28  13  22  13  39       +N2  22  81  97  28
29  13  22  13  39       +N2  20   4  44  48
30  13  22  13  39       +N2  19  41  52  25
31  13  22  13  39       +N2  30  37  66  24
32  13  22  13  39       +N2  32  40  56  92
33  13  22  13  39       +N2  95  76  91  31
34  13  22  13  39       +N2  36  66  17  85
35  65  88  50  34       +N2  22  81  97  28
36  65  88  50  34       +N2  20   4  44  48
37  65  88  50  34       +N2  19  41  52  25
38  65  88  50  34       +N2  30  37  66  24
39  65  88  50  34       +N2  32  40  56  92
40  65  88  50  34       +N2  95  76  91  31
41  65  88  50  34       +N2  36  66  17  85
42  35  31  72  34       +N2  22  81  97  28
43  35  31  72  34       +N2  20   4  44  48
44  35  31  72  34       +N2  19  41  52  25
45  35  31  72  34       +N2  30  37  66  24
46  35  31  72  34       +N2  32  40  56  92
47  35  31  72  34       +N2  95  76  91  31
48  35  31  72  34       +N2  36  66  17  85

Which has essentially created a duplicate dataframe for every index of my "right" dataframe.

On the one hand, I know how to do another workaround to get the dataframe I want with more code and breaking up one of the dataframes a bit more so that the merge is "prettier". However, I really want to know WHY this is happening? I want to be able to use the merge functionality in the future, but I don't fully understand what pandas is trying to do to produce this.

Any help is greatly appreciated.

Matt P.
  • 153
  • 1
  • 10

1 Answers1

1

The result looks correct. The join criteria is for every row of left with MergeThis equal to +N2 to be joined to each row of right that has that matching condition. There are 7 such rows in each of the data frames, resulting in 49 rows for the cartesian product of row sets in the output.

Note that the merge type for DataFrame.merge are left, right, inner and outer, which determine the set of keys to use for the join, not a restriction on which rows will match to which other rows. The default of inner just means to use the intersection of values contained in the join column(s) between left and right, which will just be +N2 anyway. If you used right you'd get the same result since the right DataFrame contains only +N2 for this column. Likewise, if you used left or outer they would happen to be the same as each other in this case, because either would cause the possible match set to be based on either -N2 or +N2 and you'd get the 49 rows from all pairwise matches of +N2 plus the other rows for -N2 (but with not additional matches for them, and NaN values for the columns that would come from right in that case).

From the sound of it, you might have expected the 7 rows of +N2 in right to just be attached to left with NaN filled in for the other rows. To achieve this, you should use concatenation not merge-based joining.

Here's an example with similar toy data using pandas.concat (you can just remove the extra MergeThis column when you're done),

In [42]: right
Out[42]:
   C  D MergeThis
0  1  2       +N2
1  3  4       +N2
2  5  6       +N2
3  7  8       +N2

In [43]: left
Out[43]:
    A   B MergeThis
0   1   2       -N2
1   3   4       -N2
2   5   6       +N2
3   7   8       +N2
4   9  10       +N2
5  11  12       +N2

In [44]: right.index = left[left.MergeThis == '+N2'].index.values

In [45]: right
Out[45]:
   C  D MergeThis
2  1  2       +N2
3  3  4       +N2
4  5  6       +N2
5  7  8       +N2

In [46]: pandas.concat([left, right], axis=1)
Out[46]:
    A   B MergeThis    C    D MergeThis
0   1   2       -N2  NaN  NaN       NaN
1   3   4       -N2  NaN  NaN       NaN
2   5   6       +N2  1.0  2.0       +N2
3   7   8       +N2  3.0  4.0       +N2
4   9  10       +N2  5.0  6.0       +N2
5  11  12       +N2  7.0  8.0       +N2
ely
  • 74,674
  • 34
  • 147
  • 228
  • Yes, I think this is the solution. My plan was to us concat if I could not figure it out, and it seems I was right. A followup question to this is that the "merge" command in the documentation appears to create the desired output (or something similar). Is this because the "key" column contains a unique value for each index? [K0, K1, K2]. – Matt P. Dec 06 '18 at 20:12