2

I created a means column for a group based on a criterium C. Now I want those means to be filled out over the entire column, even when criterium C does not hold. So basically I want to replace NA's with the mean value calculated for that group. You can see the grp, val and C colum in the next Data.table

    grp val C
 1:   1  NA 0
 2:   1  NA 0
 3:   1  42 1
 4:   1  42 1
 5:   2  16 1
 6:   2  16 1
 7:   2  NA 0
 8:   2  NA 0
 9:   3  32 1
10:   3  32 1
11:   3  32 1
12:   3  32 1

So I want to replace the val NA's with the mean value in the same group. Here is sample code of how I attempt to do it. Basicly I extract another data.table, remove the NA's and duplicates and then try to merge it with the original table.

x <- data.table(grp=c(1,1,1,1,2,2,2,2,3,3,3,3),val=c(NA,NA,42,42,16,16,NA,NA,32,32,32,32),C=c(0,0,1,1,1,1,0,0,1,1,1,1))
y <- x[!is.na(val),]
y <- y[!duplicated(y),]
setkey(x,grp)
setkey(y,grp)
x[y,val:=val,by=grp]

while this does not give any errors it leaves the original column val untouched. What am I doing wrong? what would be a better approach?

Gullydwarf
  • 345
  • 3
  • 15
  • 5
    Do you just want `x[, val := mean(val, na.rm = TRUE), grp]`?? – David Arenburg Jan 20 '15 at 14:01
  • 1
    seems so... :o I've been spending almost a days work on this... thanks! – Gullydwarf Jan 20 '15 at 14:08
  • I'd suggest you spend some time on reading on how to use `data.table` appropriately. Start [**here**](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html) – David Arenburg Jan 20 '15 at 14:11
  • If only you knew how much time I spend reading about data.table :( I guess I was suffering from some tunnelvision... another mean calculation... :P Perhaps as you are so experienced with it you could write some more documentation on different merges and other operations as I feel there is a significant lack of it. That link seems awesome at first sight though! – Gullydwarf Jan 20 '15 at 14:14
  • 1
    and don't forget to answer your question to remove it from unanswered list :) You might expect better documentation in 1.9.6. – jangorecki Jan 20 '15 at 15:59
  • Dplyr or plyr will help too. – Alex Brown Jan 20 '15 at 16:23
  • 1
    @AlexBrown don't see how `dplyr` or especially `plyr` can provide a better solution than a simple line with `data.table` syntax – David Arenburg Jan 20 '15 at 20:21
  • Data.table is no less core than dplyr, and significantly less so than plyr. There's nothing g simple about data.table. – Alex Brown Jan 20 '15 at 20:22
  • 1
    @AlexBrown First, this is a data.table question, Second, what is complicated about `x[, val := mean(val, na.rm = TRUE), grp]`? Why would the OP need to load some other package which will never achieve this performance/simplicity (neither don't have assignment by reference operator (`:=`)). Also, how `plyr` is more core than `data.table`?? Not to mention it has worse performance than even base R and it completely outdated. – David Arenburg Jan 20 '15 at 21:14
  • Fair enough, I missed that it was a data.table question – Alex Brown Jan 20 '15 at 21:15
  • Plyr is more core because it only uses standard R primitives. Although it does have that performance issue you mention. – Alex Brown Jan 20 '15 at 21:16
  • Although base has performance issues too and that never seemed to cause a problem. Sorry, hobby horse of mine. – Alex Brown Jan 20 '15 at 21:17
  • 1
    @DavidArenburg if you would post an answer I will accept it. Or should I post an answer myself? :P Of course it is questionable if this post should remain on StackOverflow because of the simplicity of the problem/solution.. – Gullydwarf Jan 21 '15 at 09:44
  • Added an answer, as it seems like it could be useful for future readers. – David Arenburg Jan 21 '15 at 11:40

2 Answers2

2

So it seems like this question is driving lots of "noise", so I'll add this as an answer.

So data.table has an "assignment by reference operator" which is := (see here for more info and use cases/benchmarks).

This operator is assigning values to all the members of the particular group (although you can also use it without grouping by anything), similar to mutate function in dplyr or ave and transform in base R, but it does it by reference (which isn't too important for this question specifically, but is probably its greatest advantage over the equivalents in other packages/base R), i.e., it is updating the data set itself without creating copies while using the <- operator.

To sum things up, if you want to calculate some metric per group and assign it to each value in that particular group, use :=.

On other hand, if you want just the summary, use = instead (with combination with list() or just .()), or if you don't want to name the result of the aggregation, you don't have to use anything at all as in:

x[, .(val = mean(val, na.rm = TRUE)), grp] 

Or

x[, list(val = mean(val, na.rm = TRUE)), grp]

Or just

x[, mean(val, na.rm = TRUE), grp] # will call the aggregated variable `V1` by default

The equivalents for this in dplyr would be summarise and in base R it would be aggregate or sometimes tapply.


That being said, in your specific case you would use the := operator in order to assign the mean value per group to each value in that particular group as in:

x[, val := mean(val, na.rm = TRUE), grp]
Community
  • 1
  • 1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

For imputing the NA's with group mean, data.table and dplyr would work well (data.table vs dplyr is a separate discussion). Refer @ David Arenburg's comment for data.table method code for replacing NA with mean.

Using dplyr:

library(dplyr)
df %>% group_by(grp) %>% mutate(val= replace(val, is.na(val), mean(val, na.rm=TRUE))) # ifelse can also be tried instead of replace

Less elegant way is through a custom function combined with ddply:

library(plyr)
# function to replace NA with mean for that group
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

df <- ddply(df, ~ grp, transform, val = impute.mean(val))
Manohar Swamynathan
  • 2,065
  • 21
  • 23
  • 1
    Why *the best* way is using `ddply`? what wrong with just the `data.table` syntax? he is already using? And if you already going with the `plyr` path, why do you need to create a new function, what's wrong with just `ddply(x, ~ grp, transform, val = mean(val, na.rm = TRUE))`? – David Arenburg Jan 20 '15 at 20:45
  • I realized my mistake, thanks David. I have now edited the answer and up voted your comments containing the codes. – Manohar Swamynathan Jan 21 '15 at 03:35