1

I have 3 .csv files that I need to analyse in R. File one contains columns with user id and signupdate. File two contains columns with user id, purchase date and amount of purchases. File three contains columns with user id, message date and number of messages.

Please note that the order of the user id is not the same in each of the three files, thus cop.

Would love some help merging these files so that the large dataset has order user id, signupdate, purchase date, amount of purchases, message date and number of messages. Can't seem to find code to do this in R Thanks in advance

MrFlick
  • 195,160
  • 17
  • 277
  • 295
REnthusiast
  • 1,591
  • 3
  • 16
  • 18
  • I'm assuming you know how to read them in separately. It would be a good idea to post some [sample data](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so it would be easier to offer specific suggestions. – MrFlick Aug 21 '14 at 02:30
  • @MatthewLundberg: Thanks for correcting. One might do: df=merge(df1,df2); df=merge(df,df3) – rnso Aug 21 '14 at 03:09
  • @rnso Of course. That's what `Reduce` will do. – Matthew Lundberg Aug 21 '14 at 03:29

1 Answers1

3

While merge doesn't take three arguments, Reduce is made for the task of iterating over a list and passing pairs to a function. Here's an example of a three-way merge:

d1 <- data.frame(id=letters[1:3], x=2:4)
d2 <- data.frame(id=letters[3:1], y=5:7)
d3 <- data.frame(id=c('b', 'c', 'a'), z=c(5,6,8))
Reduce(merge, list(d1, d2, d3))
##   id x y z
## 1  a 2 7 8
## 2  b 3 6 5
## 3  c 4 5 6

Note that the order of the column id is not the same, but the values are matched.

In the case where you have non-matching data and want to keep all possible rows, you need an outer join, by supplying all=TRUE to merge. As Reduce does not have a way to pass additional arguments to the function, a new function must be created to call merge:

d1 <- data.frame(id=letters[1:3], x=2:4)
d2 <- data.frame(id=letters[3:1], y=5:7)
d3 <- data.frame(id=c('b', 'c', 'd'), z=c(5,6,8))
Reduce(function(x,y) merge(x,y,all=TRUE), list(d1, d2, d3))
##   id  x  y  z
## 1  a  2  7 NA
## 2  b  3  6  5
## 3  c  4  5  6
## 4  d NA NA  8

NA is used to indicate data in non-matched rows.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Hi, I just played with your sample and have a question. I changed one of the ids in d3 from a to d and ran the script. I did not see rows for id a and d in the outcome. If you want to preserve all ids in the outcome, what could we do? – jazzurro Aug 21 '14 at 02:43