0

I have train with 3756 rows and test with 500 rows, after join I had 798974 rows.

code for join:

test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')

Use of drop duplicates is works, but required a lot of time and memory.

Dmitry Sokolov
  • 1,303
  • 1
  • 17
  • 29
  • hmmm, `test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')` required a lot of time and memory ? – jezrael Apr 19 '22 at 10:02
  • I mean drop_duplicates after join operation, because join create a df with 798974 rows – Dmitry Sokolov Apr 19 '22 at 10:03
  • I'm interested why join create a df with duplicates – Dmitry Sokolov Apr 19 '22 at 10:04
  • Because that's how it's defined. [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) PS [mre] PS Please clarify via edits, not comments. – philipxy Apr 19 '22 at 10:04
  • When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) – philipxy Apr 19 '22 at 10:13

1 Answers1

1

Reason is duplicated values of column link_1 in test and train, so for each duplicated values get all combinations between:

train = pd.DataFrame({"link_1": [0, 0, 0, 0, 1, 1, 1, 1],
                      'claps_link_1_mean': range(8)})
test = pd.DataFrame({"link_1": [0, 1, 1, 1]})

df = test.join(train.set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
print (df)
   link_1  claps_link_1_mean
0       0                  0
0       0                  1
0       0                  2
0       0                  3
1       1                  4
1       1                  5
1       1                  6
1       1                  7
2       1                  4
2       1                  5
2       1                  6
2       1                  7
3       1                  4
3       1                  5
3       1                  6
3       1                  7

If remove duplicates in one of them before join all working well:

test.join(train.drop_duplicates('link_1').set_index('link_1')['claps_link_1_mean'], on='link_1', how='left')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252