0

I'm trying to merge two data frames with different dimensions, but just can't seem to get the result I'm after. Here's an example of what I'm trying to achieve:

Data frame A looks like this:

id        value
A         X
A         Y
A         Z
B         Y
C         X
C         Z

Data frame B looks like this:

id        value
A         U
A         W
B         U
B         W
B         V
C         V

And the data frame I'm trying to create looks like:

id        value.A        value.B
A         X              U
A         Y              W
A         Z              NA
B         Y              U
B         NA             W
B         NA             V
C         X              V
C         Z              NA

I've been trying to merge the data frames with the base merge command, but keep getting repeated values instead of NA's. It also seems to expand the resulting data frame to allow for unique combinations of value.A and value.B, which I'm also not interested in. Finally, not all ID's in data frame B have a matching ID in data frame A, in which case I would like value.B to be NA for that ID.

Is there a way to achieve what I'm looking for with merge or am I looking for another (set of) command(s)?

Edit:

Just to clarify, I did try the various combinations of joins, but still haven't found a solution.

Example:

A <- data.frame(id = c("A", "A", "A", "B", "C", "C"),
                value = c("X", "Y", "Z", "Y", "X", "Z"))

B <- data.frame(id = c("A", "A", "B", "B", "B", "C"),
                value = c("U", "W", "U", "W", "V", "V"))

merge(A, B, by="id", all=TRUE)

Produces:

id    value.x    value.y
A     X          U
A     X          W
A     Y          U
A     Y          W
A     Z          U
A     Z          W
B     Y          U
B     Y          W
B     Y          V
C     X          V
C     Z          V

That is to say, a much larger data frame than what I'm looking for as the values are repeated for every unique pair of values in data frame A and B.

Edit 2: Looks like I simplified my initial example too much, so just to clarify why the order of my values matters, I'll have to expand the original example a bit:

Data frame A looks like this:

id        value        rank
A         X            1
A         Y            0.5
A         Z            0.2
B         Y            1
C         X            1
C         Z            0.8

Data frame B looks like this:

id        value        rank
A         U            1
A         W            0.6
B         U            1
B         W            0.2
B         V            0.1
C         V            1

And the intended data frame looks like this:

id        value.A        rank.A        value.B        rank.B
A         X              1             U              1
A         Y              0.5           W              0.6
A         Z              0.2           NA             NA
B         Y              1             U              1
B         NA             NA            W              0.2
B         NA             NA            V              0.1
C         X              1             V              1
C         Z              0.8           NA             NA
ageil
  • 171
  • 1
  • 3
  • 16
  • I think this is not a really data frame problem. Because the order of U and W corresponding to X, Y and Z is not defined, it can actually be random. Maybe using list is better. – Psidom Jun 25 '16 at 21:16
  • 2
    I am thinking of it as a fancy version of `cbind` where the order of the two data sets are correct, but one is shorter than the other. The goal is essentially to cbind by group where the number of observations by group does not necessarily line up. Is this true @ageil? – lmo Jun 25 '16 at 21:19
  • @Psidom Yes, I suppose you're right, the order of my values is actually random here. In my real dataset, however, the values are actually tied to a third numeric variable ranking X over Y for A in data frame A. I should probably revise the question again to reflect that, thanks. – ageil Jun 25 '16 at 21:22
  • And yes @lmo, I think you're right about it essentially being an advanced version of cbind. – ageil Jun 25 '16 at 21:25
  • 1
    I came up with a relatively compact code that does the job, but it seems answers are disabled. Has the question been closed on account of it being a duplicate? I do not really think this fits the merge category, therefore it is not a duplicate as pointed out above. Is it possible to reopen the question? – thepule Jun 25 '16 at 22:49
  • I'm still looking for a solution, so I'd love to see your idea @thepule! I didn't lock the thread, but I guess I could open another if the mods still insist on this being a duplicate question? – ageil Jun 25 '16 at 22:54
  • Maybe try to flag it for moderation first explaining why it is not a duplicate? I am new to StackOverflow so I am not sure what the correct procedure is. – thepule Jun 25 '16 at 23:02
  • I flagged it for further moderation...let's see what happens. – thepule Jun 25 '16 at 23:20
  • Yeah, I just did the same. It's getting late here though, so I'll check back for your solution tomorrow morning. – ageil Jun 25 '16 at 23:26
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115646/discussion-between-thepule-and-ageil). – thepule Jun 26 '16 at 15:10

2 Answers2

2

Here is my fancy cbind. I create secondary IDs that assume the data is ordered in with corresponding rows and df2 may end before df for each id:

# add secondary IDs
df$idCnt <- rep_len(1L, length=nrow(df))
df$idCnt <- ave(df$idCnt, df$id, FUN=cumsum)

df2$idCnt <- rep_len(1L, length=nrow(df2))
df2$idCnt <- ave(df2$idCnt, df2$id, FUN=cumsum)

merge(df, df2, by=c("id", "idCnt"), all=TRUE)
id idCnt value.x value.y
1  A     1       X       U
2  A     2       Y       W
3  A     3       Z    <NA>
4  B     1       Y       U
5  B     2    <NA>       W
6  B     3    <NA>       V
7  C     1       X       V
8  C     2       Z    <NA>

Here is the data:

**data**
df <- read.table(header=T, text="id        value
A         X
A         Y
A         Z
B         Y
C         X
C         Z", stringsAsFactors=F)

df2 <- read.table(header=T, text="id        value
A         U
A         W
B         U
B         W
B         V
C         V", stringsAsFactors=F)
lmo
  • 37,904
  • 9
  • 56
  • 69
0

You want the 'all' argument in merge function.

merge(a, b, all = TRUE)
mkearney
  • 1,266
  • 10
  • 18
  • I edited the question to clarify the problem. Unfortunately, `all = TRUE` doesn't seem to produce quite what I'm looking for. – ageil Jun 25 '16 at 21:02