5

I've been searching for a solution and have been experimenting, but I can't seem to perform what I should be a simple task.

I have two data frames formatted similar to the below toy examples

DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), B=c("kittens","puppies","kittens",NA), C=c(88,99,101,110))

    A       B           C
1   cats    kittens     88
2   dogs    puppies     99
3   NA      kittens     101
4   dogs    NA          110


DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))

    D   A       B
1   1   cats    kittens
2   2   dogs    puppies

I wish to merge the two data sets such that the output is:

      A     B         C     D
1   cats    kittens   88    1
2   dogs    puppies   99    2
3   dogs    NA        110   2
4     NA    kittens   101   1

In other words, any rows with labels A=="cats" or B=="kittens" will be mapped to 1 in the column D, any rows with A=="dogs" or B=="puppies" will be mapped to 2.

I have used the command

merge(DF1, DF2, by=c("A","B"), all.x=TRUE)

However this not match rows 3 and 4 correctly, only rows 1 and 2. I get the output

      A     B         C     D
1   cats    kittens   88    1
2   dogs    puppies   99    2
3   dogs    NA        110   NA
4     NA    kittens   101   NA

Please note the actual datasets I'm working with are very long. In reality DF1 is over 1,000,000 rows and DF2 is over 300,000 rows thousands of rows each, so a solution that could be scaled is what I really need.

Starcalibre
  • 105
  • 1
  • 6

3 Answers3

3

Perhaps you can try something along these lines:

temp <- merge(DF1, DF2, by=c("A","B"), all.x=TRUE)

within(temp, {
  M1 <- c("cats", "kittens")
  D <- ifelse(A %in% M1 | B %in% M1, 1, 2)
  rm(M1)
})
#      A       B   C D
# 1 cats kittens  88 1
# 2 dogs puppies  99 2
# 3 dogs    <NA> 110 2
# 4 <NA> kittens 101 1

You can nest ifelse statements if you need more than just these two options.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks, this works fine for the toy example but my original datasets are very long (there are thousands of different pairs). Is there a solution that would work for really large dataframes ? – Starcalibre Apr 30 '13 at 07:07
2
DF1[which(DF1$A=="cats"|DF1$B=="kittens"), "D"] <- DF2[which(DF2$A=="cats"|DF2$B=="kittens"), "D"]
DF1[which(DF1$A=="dogs"|DF1$B=="puppies"), "D"] <- DF2[which(DF2$A=="dogs"|DF2$B=="puppies"), "D"]
DF1
#-------
     A       B   C D
1 cats kittens  88 1
2 dogs puppies  99 2
3 <NA> kittens 101 1
4 dogs    <NA> 110 2

Functionalized:

idxpick <- function(a,b) DF1[which(DF1$A==a|DF1$B==b), "D"] <<- # Yes, I feel guilty.
                                   DF2[which(DF2$A==a|DF2$B==b), "D"]
DF1 = data.frame(A=c("cats","dogs",NA,"dogs"), 
                 B=c("kittens","puppies","kittens",NA), 
                 C=c(88,99,101,110))
DF2 = data.frame(D=c(1,2), A=c("cats","dogs"), B=c("kittens","puppies"))
apply(DF2, 1, function(rr) idxpick(rr["A"], rr["B"]) )
#------------
[1] 1 2

DF1
     A       B   C D
1 cats kittens  88 1
2 dogs puppies  99 2
3 <NA> kittens 101 1
4 dogs    <NA> 110 2
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Hey, thanks for the solution. Sorry but I should've been more clear that the datasets I'm working with are very very large so writing a command for every possible pair isn't really feasible. There's about 300,000 different types of pairs. – Starcalibre Apr 30 '13 at 07:20
  • Might be possible to make this into a function that could be `apply`-ed to DF2. It will take a long time. Might need to leave running overnight. – IRTFM Apr 30 '13 at 07:30
  • A rare `<<-` spotted in the wild, neat. – Ben Apr 30 '13 at 07:42
  • I can probably get rid of it with an eval-substitute backflip. I know where to look for some examples but I'm feeling lazy. – IRTFM Apr 30 '13 at 07:52
  • Thanks, that code works great for the example I posted. I'm having a little trouble modifying it to work on another example where there's more than two different pairs. See the example I posted in the edit above. Once I've worked that out, I can mark this as correct. Cheers :) – Starcalibre Apr 30 '13 at 08:22
  • 1
    Well thanks a lot. I solve teh problem that was posed and it works great, but you won't mark it correct until I solve another problem that wasn't posed? – IRTFM Apr 30 '13 at 08:28
  • Sorry, maybe my original question wasn't clear enough. I'll mark your answer as correct as leave it for now. – Starcalibre Apr 30 '13 at 08:31
2

Here's a different approach:

library(functional)

partial.merge <- function(DF1, DF2) {
  common.cols <- intersect(names(DF1), names(DF2))
  result.col <- names(DF2)[!(names(DF2) %in% common.cols)]

  # This can only handle one result column:
  stopifnot(length(result.col) == 1)

  # Merge in each common column, one at a time.
  # The identical operation is done for each common column, so Reduce is useful:
  r <- Reduce(function(D, C) merge(D, DF2[c(C, result.col)], by=c(C), all.x=TRUE), x=common.cols, init=DF1)

  # The merge created cols like c('D.x', 'D.y').  These are the columns:
  merge.cols <- paste(result.col, c('x', 'y'), sep='.')

  # The .x and .y columns are partial, put them together:
  r[[result.col]] <- rowMeans(r[merge.cols], na.rm=TRUE)

  # Remove the temporaries:
  for (i in merge.cols) {
    r[[i]] <- NULL
  }
  return(r)
}

partial.merge(DF1, DF2)
##         B    A   C D
## 1 kittens cats  88 1
## 2 kittens <NA> 101 1
## 3 puppies dogs  99 2
## 4    <NA> dogs 110 2
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • While I don't mind if you select my answer as correct, I suggest that you actually try these on your real data and check the run time with `system.time()`. This is almost certainly the slowest of the three! – Matthew Lundberg Apr 30 '13 at 13:50