Increasing number of rows after merging 2 csv files with same number of rows

Question

I'm just starting in R, and this probably is really simple, but here we go: I'm merging 2 csv files with the same number of rows by a common column "IID". The number of rows for each of the csv files is 566. When I perform the merge, this amounts to 568. Why is that? Is there anything being double counted?

data1<-read.csv("lala1.csv", head=T,sep=",")
data2<-read.csv("lala2.csv", head=T,sep=",")
dataMerge<-merge(data1,data2,by="IID")
write.csv(dataMerge, "lala_All_.csv")

data1

. FID IID T1 1 1 26014 0.004755930 2 1 31935 0.006217750 3 2 27841 0.005631550 4 2 32400 0.004967280 5 3 28814 0.005395390 6 3 33276 0.005683260 7 4 29445 0.005079630 8 5 28593 0.004365470 9 5 29851 0.005743720 10 6 29852 0.004015790 11 6 31820 0.005455630 12 7 33386 0.005741080 13 7 37065 0.006015120 14 8 33475 0.004334330 15 8 38487 0.004427610 16 9 34157 0.004991320 17 9 39281 0.005380730 18 10 36963 0.005759670 19 10 39673 0.003827710 20 11 36964 0.006559190

data2

. FID IID T2 1 1 26014 0.009336160 2 1 31935 0.011135400 3 2 27841 0.010364200 4 2 32400 0.009077190 5 3 28814 0.009483640 6 3 33276 0.009537670 7 4 29445 0.009054180 8 5 28593 0.008029250 9 5 29851 0.010172700 10 6 29852 0.006687910 11 6 31820 0.009435050 12 7 33386 0.010146600 13 7 37065 0.010533700 14 8 33475 0.007430050 15 8 38487 0.008475400 16 9 34157 0.008564570 17 9 39281 0.009678990 18 10 36963 0.010390400 19 10 39673 0.006431420 20 11 36964 0.011702000

If there is an `"IID"` **value** in one but not the other, then a new row will be created. — juan, Jul 19 '17 at 17:18
You likely have duplicated or missing IID in one or both of the tables. Look at `anyDuplicated(data1$IID)` and `anyDuplicated(data2$IID)` But really this stuff is just guess work without a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — MrFlick, Jul 19 '17 at 17:22

score 1 · Answer 1 · answered Jul 19 '17 at 17:34

1

You probably have a problem with mismatched IIDD. Consider these examples.

df1 <- data.frame(IIDD = c("A", "B", "c", "D"), trait = c('x', 'x', 'x', 'x'))
df2 <- data.frame(IIDD = c("A", "B", "c", "E"), sales = c(400, 200, 700, 500))
df3 <- data.frame(IIDD = c("A", "B", "c", "C"), sales = c(400, 200, 700, 500))

Use the all.equal and unique functions to test.

all.equal(unique(df1$IIDD), unique(df2$IIDD))
all.equal(unique(df1$IIDD), unique(df3$IIDD))

answered Jul 19 '17 at 17:34

pyll

1,688
1
26
44

I tried it and tested TRUE. – Adri Jul 19 '17 at 17:39
any variables besides `IIDD` that the data frames have in common? Or is that the only one present in each? – pyll Jul 19 '17 at 17:40
FID is another common variable – Adri Jul 19 '17 at 17:46
run the same test with `FID` instead of `IIDD` – pyll Jul 19 '17 at 17:47
tried it and also TRUE :/ – Adri Jul 19 '17 at 17:50
1

would really be helpful if you could post the data...ever use `dput()`? – pyll Jul 19 '17 at 17:50
If `FID` is in both data frames, then you are probably getting `FID.x` and `FID.y` also right? – pyll Jul 19 '17 at 18:09
Yes, I'm getting FID.x and FID.y . I have not used dput() before :( – Adri Jul 19 '17 at 18:12
I posted some of the data – Adri Jul 19 '17 at 18:24

score 1 · Answer 2 · answered Jul 19 '17 at 17:54

There seems to be some issue with merging and matching, why not use a join that will only include the exact values from whatever dataset you are looking to match to. Without a reproducible data set, it makes this difficult though. Using the same input as @pyll

library(dplyr)
df <- df1 %>% 
  left_join(df2, "IIDD")
df[is.na(df$sales),]

You can identify the exact row number of NA matches, and possibly identify your problem

Increasing number of rows after merging 2 csv files with same number of rows

data1

data2

2 Answers2