0

A lot comes together in this question. First off all I would like to segment the data by column c. The subsets are given by the factor c: the levels are 1 to 4. So 4 distinct segments. Next I have two columns. Column a and b. I would like to replace the NA's with the maximum value of each segment specific column. So for example, NA at row 3 and column 'a', this would be 30. (b,3) would be 80, (b,8) would be 50 and (a, 5) would be 80.

I have created the code below that does the job, but now I need to make it automatic (like a for loop) for all segments and columns. How could I do this?

a <- c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40)
b <- c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90)
c <- c(1,1,1,2,2,2,2,2,3,3,3,4,4,4)


      a  b c
 1:  10 80 1
 2:  NA 70 1
 3:  30 NA 1
 4:  40 50 2
 5:  NA 40 2
 6:  60 30 2
 7:  70 20 2
 8:  80 NA 2
 9:  90  0 3
 10: 90  0 3
 11: 80 10 3
 12: 90 69 4
 13: 10 40 4
 14: 40 90 4


mytable <- data.table(a,b,c)

mytable[which(is.na(mytable[c == 1][,1, with = FALSE]) == TRUE),1] <- max(mytable[c==1,1], na.rm = TRUE)

Unfortunately, this try results in an error:

for(i in unique(mytable$c)){
  for(j in unique(c(1:2))){
    mytable[which(is.na(mytable[c == i][,j, with = FALSE]) == TRUE),j, with = FALSE] <- max(mytable[c==i][,j, with = FALSE], na.rm = TRUE)
  }
}

Error in [<-.data.table(*tmp*, which(is.na(mytable[c == i][, j, with = FALSE]) == : unused argument (with = FALSE)

Surprisingly, this results in an error as well:

for(i in unique(mytable$c)){
  for(j in unique(c(1:2))){
    mytable[which(is.na(mytable[c == i][,j]) == TRUE),j] <- max(mytable[c==i,j], na.rm = TRUE)
  }
}

Error in [.data.table(mytable, c == i, j) : j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[,..j] or DT[,j,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

cappuccino
  • 325
  • 3
  • 13
  • Using `dplyr` : `mytable %>% group_by(c) %>% mutate(a = ifelse(is.na(a),max(a,na.rm=TRUE),a), b = ifelse(is.na(b),max(b,na.rm=TRUE),b))` – count Mar 20 '17 at 13:48

3 Answers3

4
library("data.table")

mytable <- data.table(
a=c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40),
b=c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90),
c=c(1,1,1,2,2,2,2,2,3,3,3,4,4,4))

foo <- function(x) { x[is.na(x)] <- max(x, na.rm=TRUE); x }

mytable[, .(A=foo(a), B=foo(b)), by=c]

result:

> mytable[, .(A=foo(a), B=foo(b)), by=c]
#    c  A  B
# 1: 1 10 80
# 2: 1 30 70
# 3: 1 30 80
# 4: 2 40 50
# 5: 2 80 40
# 6: 2 60 30
# 7: 2 70 20
# 8: 2 80 50
# 9: 3 90  0
#10: 3 90  0
#11: 3 80 10
#12: 4 90 69
#13: 4 10 40
#14: 4 40 90

or for direct substitution of a and b:

mytable[, `:=`(a=foo(a), b=foo(b)), by=c] # or
mytable[, c("a", "b") := (lapply(.SD, foo)), by = c]  # from @Sotos

or the safer variant (tnx to @Frank for the remark):

cols <- c("a", "b")
mytable[, (cols) := lapply(.SD, foo), by=c, .SDcols=cols]
jogo
  • 12,469
  • 11
  • 37
  • 42
2

Using data.table

library(data.table)
mytable[, a := ifelse(is.na(a), max(a, na.rm = TRUE), a), by = c]
mytable[, b := ifelse(is.na(b), max(b, na.rm = TRUE), b), by = c]

Or in a single command

mytable[, c("a", "b") := lapply(.SD, function(x) ifelse(is.na(x), max(x, na.rm = TRUE), x)), .SDcols = c("a", "b"), by = c]
manotheshark
  • 4,297
  • 17
  • 30
0

Use ddply() from package plyr:

df<-data.frame(a,b,c=as.factor(c))
library(plyr)
df2<-ddply(df, .(c), transform, a=ifelse(is.na(a), max(a, na.rm=T),a), 
           b=ifelse(is.na(b), max(b, na.rm=T),b))
Axeman
  • 32,068
  • 8
  • 81
  • 94
user3640617
  • 1,546
  • 13
  • 21