2

This question is just to understand why this would happen.

I'm merging two databases:

bot.rep.geo <- merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE)

The original databases have the following dimensions: bot.rep has 1634451 observations, geo.2016 has 1393.

After merging using all.x = TRUE, the new database emerges with 1727681, instead of the same size as bot.rep.

Why does this happen?

After a quick review, I realised it was creating some duplicates, but I don't understand the reason and if I'm doing something wrong while using the merge function.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Rafael Sierra
  • 107
  • 1
  • 2
  • 11
  • 3
    By choosing `all.x = TRUE` you are doing the equivalent of a `LEFT JOIN` in SQL. This does not mean that a given row from `bot.rep` will appear only once. Instead, it will appear once for each match in `cod.geo`, and each row in `bot.rep` is guaranteed to appear even if does not match to anything. – Tim Biegeleisen Jun 17 '16 at 06:27
  • Show us some sample data and this will be clear. – Tim Biegeleisen Jun 17 '16 at 06:29
  • Thank you @TimBiegeleisen for the clarification. Is there any way to prevent this or just need to remove duplicates afterwards? – Rafael Sierra Jun 17 '16 at 06:29
  • 1
    @RafaelSierra do a `unique(merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE))`... it's the same as doing it afterwards... merge function doesn't have a unique parameter – Gaurav Jun 17 '16 at 06:33
  • You guys are right. I hadn't realised I had and issue with a particular variable that I was creating as a concatenate from 3 others. The variable was 1 1 11, but there is a case were a 1 11 1 can happen and I hadn't noticed it before. Thank you all for your help! – Rafael Sierra Jun 17 '16 at 06:37

2 Answers2

1

There may be lines in the geo.2016 table where the cod.geo value appears twice or more.

if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.

user1923975
  • 1,389
  • 4
  • 13
  • 29
1

This happens because of one-to-many relationship, x has multiple rows matching in y.

See example below, where bot.rep cod.geo value of 1 has 2 matches in geo.2016 dataset. Hence, we have 2 rows for 1 id. Also, notice we are creating NA rows for non matched ids because of all.x = TRUE argument.

Now, you need to decide which row is a duplicate for cod.geo value 1.

#dummy data
bot.rep <- data.frame(cod.geo = 1:4)
geo.2016 <- data.frame(cod.geo = c(1,1,3,5,6), z = 1:5)

bot.rep.geo <- merge(x = bot.rep, y = geo.2016,
                     by = "cod.geo", all.x = TRUE)

#   cod.geo  z
# 1       1  1
# 2       1  2
# 3       2 NA
# 4       3  3
# 5       4 NA

You will find more info on different types of merge functions here.

Community
  • 1
  • 1
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 1
    @akrun I don't think so, as it is asking the reason why this happens, not how to merge. If you feel it is a dupe feel free to vote to close, downvote, flag, etc. – zx8754 Jun 17 '16 at 06:48
  • I was just curious, – akrun Jun 17 '16 at 06:52