Why the result of function merge is larger than the original data

Question

For example, if the x is a matrix of two variables (Time and X,the length is len1),and y is a matrix of two variables (Time and Y, the length is len2), I just want to merge x and y, using the following code:

> x
                 Time    Value
1 2013-11-03 00:00:11 535.7680
2 2013-11-03 00:00:26 548.6214
3 2013-11-03 00:00:41 543.6477
4 2013-11-03 00:00:56 554.0778
5 2013-11-03 00:01:11 566.5635
6 2013-11-03 00:01:26 555.7684
> y
                 Time    Value
1 2013-11-03 00:00:11 455.4087
2 2013-11-03 00:00:26 457.7967
3 2013-11-03 00:00:41 455.3263
4 2013-11-03 00:00:56 461.9727
5 2013-11-03 00:01:11 460.6974
6 2013-11-03 00:01:26 466.2654

res<-merge(x,y,by="Time")
> res
                 Time  Value.x  Value.y
1 2013-11-03 00:00:11 535.7680 455.4087
2 2013-11-03 00:00:26 548.6214 457.7967
3 2013-11-03 00:00:41 543.6477 455.3263
4 2013-11-03 00:00:56 554.0778 461.9727
5 2013-11-03 00:01:11 566.5635 460.6974
6 2013-11-03 00:01:26 555.7684 466.2654

I just use the head of x and y

why the length of res is larger than len1 and len2

I just want to know how to merge the x and y by the same lag "Time", the x and y of different lag "Time" is deleted

Provide [reproducible data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Tried with dummy data, and `nrow(res)` is smaller than `nrow(x)` and `nrow(y)`. — zx8754, Apr 10 '15 at 07:44
Compare, for both x and y, `length(x$Time)` and `length(unique(x$Time))` - then you'll see that maybe some Times are duplicated, explaining the larger nrow of your resulting dataframe. — Jason V, Apr 10 '15 at 08:40
I don't think I can answer that for you... if the whole row is duplicated, then I don't see why not, but if you have different values on the other variable(s), then you have to figure out if you want to keep everything, and if not, which one to erase! — Jason V, Apr 10 '15 at 08:49
But if you choose to erase rows, just `x <- x[-row.indexes,]` — Jason V, Apr 10 '15 at 08:55
To compare other variables for those duplicate Times, you can use `ind.to.delete <- which(duplicated(x[,1]));comparisons <- sort(append(ind.to.delete, ind.to.delete-1));x[comparisons,]` — Jason V, Apr 10 '15 at 09:18

score 0 · Answer 1 · answered Apr 10 '15 at 07:52

0

From the help page of merge:

The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

Without a reproducible example I can't say for sure, but it is likely that your Time column contains duplicated values. See for instance the following example:

A <- data.frame(a=c(1,2,3,1),b=1:4)
B <- data.frame(a=c(1,2,3,1),c=1:4)
merge(A,B,by="a")
  a b c
1 1 1 1
2 1 1 4
3 1 4 1
4 1 4 4
5 2 2 2
6 3 3 3

answered Apr 10 '15 at 07:52

plannapus

18,529
4
72
94

I'm a new one in stackoverflow. how can I upload the data?? In the data, the Time is different, maybe it is not the problem – Cheng Apr 10 '15 at 08:09
See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . – plannapus Apr 10 '15 at 08:10
The simplest would be to paste the result of `dput(x)` and `dput(y)` into your question. – plannapus Apr 10 '15 at 08:11
the data is bigger, so I just can't paste it – Cheng Apr 10 '15 at 08:19
Maybe just `head(x)` and `head(y)` would be a start... – Jason V Apr 10 '15 at 08:23
I just use this way to do this ,the result is correct, but in all the data, it is wrong – Cheng Apr 10 '15 at 08:25
Can you add to your question the outputs of `head(x)` and `head(y)`? – Jason V Apr 10 '15 at 08:32

Why the result of function merge is larger than the original data

1 Answers1