Pandas - Why there are duplicates after join?

Question

I have train with 3756 rows and test with 500 rows, after join I had 798974 rows.

code for join:

test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')

Use of drop duplicates is works, but required a lot of time and memory.

hmmm, `test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')` required a lot of time and memory ? — jezrael, Apr 19 '22 at 10:02
I mean drop_duplicates after join operation, because join create a df with 798974 rows — Dmitry Sokolov, Apr 19 '22 at 10:03
Because that's how it's defined. [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) PS [mre] PS Please clarify via edits, not comments. — philipxy, Apr 19 '22 at 10:04
When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) — philipxy, Apr 19 '22 at 10:13

score 1 · Accepted Answer · answered Apr 19 '22 at 10:06

Reason is duplicated values of column link_1 in test and train, so for each duplicated values get all combinations between:

train = pd.DataFrame({"link_1": [0, 0, 0, 0, 1, 1, 1, 1],
                      'claps_link_1_mean': range(8)})
test = pd.DataFrame({"link_1": [0, 1, 1, 1]})

df = test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
print (df)
   link_1  claps_link_1_mean
0       0                  0
0       0                  1
0       0                  2
0       0                  3
1       1                  4
1       1                  5
1       1                  6
1       1                  7
2       1                  4
2       1                  5
2       1                  6
2       1                  7
3       1                  4
3       1                  5
3       1                  6
3       1                  7

If remove duplicates in one of them before join all working well:

test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')

Pandas - Why there are duplicates after join?

1 Answers1