0

I'm working with python pandas now. Here is a problem I'm experiencing. There's a dataset called master, and its length comes with like this:

print(len(master))
120000

And then I try to left-outer-join this with another dataset called click:

master_active=pd.merge(master, click, how='left', on='user_id')
print(len(master_active))
120799

I don't know why the number changes from 120000 to 120799 because the merge must happen based on the dataset master.

Appreciate any single idea to solve this problem, Thanks!

user9191983
  • 505
  • 1
  • 4
  • 20

1 Answers1

1

Your merge only guarantees the result will have len(master.index) as a minimum number of rows. As @Wen mentioned, you will have more rows if click has more than one match on joining columns.

This example should clarify the behaviour:

df1 = pd.DataFrame([['a', 1, 2], ['b', 2, 3], ['c', 4, 5]], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([['a', 6, 7], ['a', 8, 9]], columns=['A', 'D', 'E'])

pd.merge(df1, df2, how='left')

#    A  B  C    D    E
# 0  a  1  2  6.0  7.0
# 1  a  1  2  8.0  9.0
# 2  b  2  3  NaN  NaN
# 3  c  4  5  NaN  NaN
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks much for your comment. Then wondering what should i do if i want to merge them based on their common column - "user_id" here? what i wanna figure out is how many people click the ad among the users listed on the master dataset. Thanks – user9191983 Feb 01 '18 at 03:04
  • No problem. It's always useful to visualise what's happening, even 3 line examples can help. Feel free to accept the answer (tick on left) if it answers your question. – jpp Feb 01 '18 at 03:07
  • well thing doesnt seem changed even after following your advice...thanks anyways – user9191983 Feb 01 '18 at 03:10
  • @user9191983, what are you trying to achieve? it sounded like an explanation for why your rows increase, but is there some other behaviour you want to force? feel free to update your question with specifics. – jpp Feb 01 '18 at 03:14
  • re: "what i wanna figure out is how many people click the ad among the users listed on the master dataset." can you give some sample input & output data. it doesn't need to be more than a few lines, but will help us understand what you want. – jpp Feb 01 '18 at 03:21
  • sorry for my poor explanation...just like what i told you, i only need click information of user_id on master, so if some user_id on click are not listed on master, they should be skipped out. meaning, even after merging happens, its length has to be 120000 because i'm trying to pick up information only for 120000 user_ids on master data – user9191983 Feb 01 '18 at 03:21
  • how about this: `click2 = click[click['user_id'].isin(set(master['user_id']))]`. `click2` should now be filtered to contain only users from `master`. – jpp Feb 01 '18 at 03:24
  • alternatively, do an 'inner' merge for user_ids common to both dataframes. – jpp Feb 01 '18 at 03:26
  • Thanks for the alternatives! but if you do that you lose all other user_id who are not in common with click...wanna keep all those 120000 user_id on master. so it doesn work...thats why i was trying outer join – user9191983 Feb 01 '18 at 03:27
  • ofcourse inner join doesn't work either because you cannot keep all the user_ids on master... – user9191983 Feb 01 '18 at 03:29