Why am I getting more rows after applying the merge(x,y,all.x=T) function in r?

Question

I do have two data sets: data2 and data 3.

The relevant information of the data3 should be added to the respective rows of data2 and the commmon columns in both set are "Inschrijfjaar" and "Leeftijd".

I am using the code:

data4=merge(x=data2,y=data3, by=c("Inschrijfjaar", "Leeftijd"),all.x=TRUE)

A check up gives me:

dim(data2)
525380    5

dim(data3)
1707      7

dim(data4)
5307668   10

So the merge is not done correctly, the dimension of data4 should also be 525380, because it is a left join. So I am getting ways more rows then the left data set. What could be the cause?

I also tried the code:

data4=merge(x=data2,y=data3,all.x=TRUE)

score 0 · Answer 1 · answered Apr 23 '21 at 09:33

Sorry, I cannot comment, this is not a full answer, but meant to be a comment: There are many differnt forms of joins.

I find them well explained here.

You do a left join, which returns all rows from the left table, and any rows with matching keys from the right table. So you would expect more values in your data4.

What you you actually want seems to be a left semi-join "A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y" (from this question which may help you answer your question).

I am still trying to figure out how to apply the left_join (), function. The error I got, is that the function could not be found, although I installed R.4.0.5 and installed the dplyr packages and used the dplyr library. I also used the links, but it is still difficult for me. If you have some more suggestions they are welcome. — Clarisha, Apr 24 '21 at 13:55

score 0 · Answer 2 · answered Apr 23 '21 at 09:42

0

This behaviour occurs when there are multiple rows of data3 that have the same values in your columns c("Inschrijfjaar", "Leeftijd") (and these values appear in data2). Where data in data3 could be merged with multiple records in data2, they will all be included, leading to more records in data4 than in data2

answered Apr 23 '21 at 09:42

Miff

7,486
20
20

both file do not have duplicates – Clarisha Apr 23 '21 at 21:21
This post brings me closer to the right way to do it. Everything that you want to be unique should be put into the the by="ID vars" part. So I put some more values and now the dimension of the merged set is 529399. The dimension of the left set now 525380, so I still have more, It could be that the original data is not clean. Now this is the merge code: data4=merge(x=data2,y=CBBdata, by=c("Inschrijfjaar", "Leeftijd","District","DistrictCode"),all.x=TRUE) – Clarisha Apr 24 '21 at 13:52

Why am I getting more rows after applying the merge(x,y,all.x=T) function in r?

2 Answers2