2

I want to do something VERY similar to this question: how to use merge() to update a table in R

but instead of just one column being the index, I want to match the new values on an arbitrary number of columns >=1.

foo <- data.frame(index1=c('a', 'b', 'b', 'd','e'),index2=c(1, 1, 2, 3, 2), value=c(100,NA, 101, NA, NA))

Which has the following values

foo
  index1 index2 value
1      a      1   100
2      b      1    NA
3      b      2   101
4      d      3    NA
5      e      2    NA

And the data frame bar

bar <- data.frame(index1=c('b', 'd'),index2=c(1,3), value=c(200, 201))

Which has the following values:

 bar
  index1 index2 value
1      b      1   200
2      d      3   201

merge(foo, bar, by='index', all=T) It results in this output:

Desired output:

foo
  index1 index2 value
1      a      1   100
2      b      1   200
3      b      2   101
4      d      3   201
5      e      2    NA
Community
  • 1
  • 1
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
  • Is there a reason `index2` is numeric for one data frame and a character for the other ? Also, do you want to merge based on both `index1` and `index2` ? – steveb Aug 18 '16 at 22:24
  • No sorry, Arun edited it , thanks. yes, merge based on both indices – wolfsatthedoor Aug 18 '16 at 22:59

4 Answers4

4

I think you don't need a merge but more to rbind and filter them later. Here I am using data.table for sugar syntax.

dx <- rbind(bar,foo)
library(data.table)
setDT(dx)
## note this can be applied to any number of index
setkeyv(dx,grep("index",names(dx),v=T))
## using unqiue to remove all duplicated 
## here it will remove the duplicated with missing values which is the 
## expected behavior
unique(dx)

#    index1 index2 value
# 1:      b      1   200
# 2:      b      2   101
# 3:      d      3   201
# 4:      a      1   100
# 5:      e      2    NA

you can be more explicit and filter your rows by group of indexs:

 dx[,ifelse(length(value)>1,value[!is.na(value)],value),key(dx)]
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    I reckon this is a good approach, it's pretty straight-forward in base R too: `foo2 <- rbind(bar, foo); foo2[!duplicated(foo2[c("index1","index2")]),]` – thelatemail Aug 18 '16 at 22:41
  • 5
    Doesn't `foo[bar, on=c("index1","index2"), value := i.value]` do it? – Frank Aug 18 '16 at 23:17
  • 1
    @Frank - that's neat - so that's analogous to like an `update` statement in `sql`? – thelatemail Aug 18 '16 at 23:41
  • @thela I think so, yeah. I never thought of the analogy before now. – Frank Aug 18 '16 at 23:47
  • @robertevansanders you can add your own answer using this code. It is even recommendable to answer its own question in SO. – agstudy Aug 21 '16 at 21:03
1

Here's an R base approach

> temp <- merge(foo, bar, by=c("index1","index2"), all=TRUE)
> temp$value <- with(temp, ifelse(is.na(value.x) & is.na(value.y), NA, rowSums(temp[,3:4], na.rm=TRUE)))
> temp <- temp[, -c(3,4)]
> temp
  index1 index2 value
1      a      1   100
2      b      1   200
3      b      2   101
4      d      3   201
5      e      2    NA
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
0

You can use some dplyr voodoo to produce what you want. The following subsets the data by unique combinations of "index1" and "index2", and checks the contents of "value" for each subset. If "value" has any non-NA values, those are returned. If only an NA value is found, that is returned.

Seems a little specific, but it seems to do what you want!

library(dplyr)

df.merged <- merge(foo, bar, all = T) %>% 
  group_by(index1, index2) %>% 
  do(
    if (any(!is.na(.$value))) {
      return(subset(., !is.na(value)))
    } else {
      return(.)
    }
  )

Output:

  index1 index2 value
  <fctr> <fctr> <dbl>
1      a      1   100
2      b      1   200
3      b      2   101
4      d      3   201
5      e      2    NA
jdobres
  • 11,339
  • 1
  • 17
  • 37
0

You can specify as many columns as you want with merge:

out <- merge(foo, bar, by=c("index1", "index2"), all.x=TRUE)
new <- apply(out[,3:4], 1, function(x) sum(x, na.rm=TRUE))
new <- ifelse(is.na(out[,3]) & is.na(out[,4]), NA, new)
out <- cbind(out[,1:2], new)
AidanGawronski
  • 2,055
  • 1
  • 14
  • 24
  • This doesn't give the correct output, it gives columns .x and columns .y – wolfsatthedoor Aug 18 '16 at 22:56
  • You don't have to specify `by.x` and `by.y` if they are identical, just `by=` will do. – thelatemail Aug 18 '16 at 23:31
  • cool, I didn't know that! In this case foo and bar both have the same column **value** which causes my solution to fail if excluded. – AidanGawronski Aug 18 '16 at 23:34
  • @AidanGawronski - it shouldn't `identical(merge(foo, bar, by=c("index1", "index2"), all.x=TRUE), merge(foo, bar, by.x=c("index1", "index2"), by.y=c("index1", "index2"), all.x=TRUE))` returns `TRUE` – thelatemail Aug 18 '16 at 23:38
  • oh! I misunderstood what you meant. Thanks! – AidanGawronski Aug 19 '16 at 00:54
  • A coherent simple merge solution would have been nice though. Anyone familiar with database management know what this type of merge I'm trying to do is called? (Merging on an index to complete columns) – wolfsatthedoor Aug 19 '16 at 14:46
  • basically you are doing a join and then summing two columns.. The solution would be much more simple if you allowed NA's to be replaced by zeros. – AidanGawronski Aug 19 '16 at 15:16
  • `out <- merge(foo, bar, by=c("index1", "index2"), all.x=TRUE)` `out[is.na(out)] <- 0` `out <- cbind(out[,1:2], out[,3] + out[,4])` `names(out) <- c("index1", "index2", "value")` – AidanGawronski Aug 19 '16 at 15:22