0

I have observations disappear after I merge two dataframes.

I have two dataframes that look like this:

df_1 
       text      user
bla bla bla     user1
ga ga ga ga     user1
bur bur bur     user2
     .            .
df_2
       user       url
      user1   asd.com
      user2   dsa.com
     .            .  

I use the unique list of users from the first one to web scrape data on them and construct the second one. I would like to merge them so they look like this:

df_merged
       text      user       url
bla bla bla     user1   asd.com
ga ga ga ga     user1   asd.com
bur bur bur     user2   dsa.com
     .            .        .

I merge them by using:

df_merged = df_1.merge(df_2, on = 'user', validate = "m:m")

The problem is that after the merge observations disappear randomly, for example:

len(df_1['user'].drop_duplicates())

returns 11115

len(df_2['user'])

returns 11115

len(df_merged['user'].drop_duplicates())

returns 7076

df_1 contains about 70k observations, while df_merged contains about 30k

Does anyone know what's going on?

1 Answers1

1

Consider amending your join to retain all of those within both data frames by using an "outer", "left" or "right" argument. This is similar to the SQL join commands if that helps.

For example, if you amend your code to the following:

df_merged = df_1.merge(df_2, how='outer',on = 'user', validate = "m:m")

When you return your answer, you will then likely see that for some of the merged operations, there will be corresponding blank values across the row.

These should be interrogated to determine how best to proceed with your data. To locate these rows with blank values, you can look at the following post in stackoverflow: How to select rows with nan in particular column

For further information on what other arguments you can pass to merge, please look here Merge documentation.

HadynB
  • 81
  • 7