Merge function generates duplicates

Question

This question is just to understand why this would happen.

I'm merging two databases:

bot.rep.geo <- merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE)

The original databases have the following dimensions: bot.rep has 1634451 observations, geo.2016 has 1393.

After merging using all.x = TRUE, the new database emerges with 1727681, instead of the same size as bot.rep.

Why does this happen?

After a quick review, I realised it was creating some duplicates, but I don't understand the reason and if I'm doing something wrong while using the merge function.

By choosing `all.x = TRUE` you are doing the equivalent of a `LEFT JOIN` in SQL. This does not mean that a given row from `bot.rep` will appear only once. Instead, it will appear once for each match in `cod.geo`, and each row in `bot.rep` is guaranteed to appear even if does not match to anything. — Tim Biegeleisen, Jun 17 '16 at 06:27
Thank you @TimBiegeleisen for the clarification. Is there any way to prevent this or just need to remove duplicates afterwards? — Rafael Sierra, Jun 17 '16 at 06:29
@RafaelSierra do a `unique(merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE))`... it's the same as doing it afterwards... merge function doesn't have a unique parameter — Gaurav, Jun 17 '16 at 06:33
You guys are right. I hadn't realised I had and issue with a particular variable that I was creating as a concatenate from 3 others. The variable was 1 1 11, but there is a case were a 1 11 1 can happen and I hadn't noticed it before. Thank you all for your help! — Rafael Sierra, Jun 17 '16 at 06:37

score 1 · Accepted Answer · answered Jun 17 '16 at 06:29

1

There may be lines in the geo.2016 table where the cod.geo value appears twice or more.

if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.

answered Jun 17 '16 at 06:29

user1923975

1,389
4
13
29

Thank you! I see now why this is happening. – Rafael Sierra Jun 17 '16 at 06:33

score 1 · Answer 2 · edited May 23 '17 at 12:08

1

This happens because of one-to-many relationship, x has multiple rows matching in y.

See example below, where bot.rep cod.geo value of 1 has 2 matches in geo.2016 dataset. Hence, we have 2 rows for 1 id. Also, notice we are creating NA rows for non matched ids because of all.x = TRUE argument.

Now, you need to decide which row is a duplicate for cod.geo value 1.

#dummy data
bot.rep <- data.frame(cod.geo = 1:4)
geo.2016 <- data.frame(cod.geo = c(1,1,3,5,6), z = 1:5)

bot.rep.geo <- merge(x = bot.rep, y = geo.2016,
                     by = "cod.geo", all.x = TRUE)

#   cod.geo  z
# 1       1  1
# 2       1  2
# 3       2 NA
# 4       3  3
# 5       4 NA

You will find more info on different types of merge functions here.

edited May 23 '17 at 12:08

Community

1
1

answered Jun 17 '16 at 06:41

zx8754

52,746
12
114
209

1

@akrun I don't think so, as it is asking the reason why this happens, not how to merge. If you feel it is a dupe feel free to vote to close, downvote, flag, etc. – zx8754 Jun 17 '16 at 06:48
I was just curious, – akrun Jun 17 '16 at 06:52

Merge function generates duplicates

2 Answers2