2

I have a data set with 30 variables. One of them is an indicator variable (0 or 1), and I would like to subtract the mean of those rows where the label is 1 for certain columns (Something like centering but taking the mean of certain rows instead of the entire column).

Col2 Col3 Col4 label
400  322  345  1    
131  345  809  1     
565  676  311  0    
121  645  777  0    
322  534  263  0    
545  222  111  0    

For the above dataset, I would like to perform the following operation for Col2:Col4:

x(i,j)-x'(,j)

where x(i,j) represents a cell, and x'(,j) represents the mean of the rows in the column for which label=1. For e.g, for [3,1] it should be

(565-mean(400,131))= 299.5

Expected output for Column 2:

Col2
134.5
-134.5
299.5
-144.5
56.5
279.5

I have been trying to use the summarise_each command but have been unsuccessful till now. The command I'm giving is

try<- group_by(data,lbl) %>% select(c(4,13:26)) %>% summarise_each(funs((.)-(mean(data[data$lbl==1,])))

But this is generating NA and I'm not really sure where I'm going wrong (I'm sure it's in the summarise_each command where I'm not able to figure out how to use funs() correctly)

Any help is appreciated. Thanks!

Mridul Garg
  • 477
  • 1
  • 8
  • 17
  • 1
    Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Jul 14 '16 at 16:35
  • @ProcrastinatusMaximus I have edited the question and I hope this makes it clearer. Thanks! – Mridul Garg Jul 14 '16 at 17:05
  • You want the mean of the columns but without values where `label == 1` ? – Steven Beaupré Jul 14 '16 at 17:31
  • @StevenBeaupré No, for each column I want to subtract the mean of rows for which label==1. – Mridul Garg Jul 14 '16 at 17:34
  • So the sum of the column (excluding value where `label == 1`) minus the mean of the values where `label == 1` ? Please provide your expected output. – Steven Beaupré Jul 14 '16 at 18:27
  • @StevenBeaupré I've made an edit showing the expected output for Col2. – Mridul Garg Jul 14 '16 at 18:44
  • I understand. I'll give it a shot when I get home. – Steven Beaupré Jul 14 '16 at 19:10

2 Answers2

4
dat %>% 
  mutate_each(funs(. - mean(.[label==1])), -label)
    Col2   Col3 Col4 label
1  134.5  -11.5 -232     1
2 -134.5   11.5  232     1
3  299.5  342.5 -266     0
4 -144.5  311.5  200     0
5   56.5  200.5 -314     0
6  279.5 -111.5 -466     0
eipi10
  • 91,525
  • 24
  • 209
  • 285
1

Here's how I would do it:

sweep(df[1:3], 2, colMeans(df[df$label == 1,][1:3]))

Which gives:

#    Col2   Col3 Col4
#1  134.5  -11.5 -232
#2 -134.5   11.5  232
#3  299.5  342.5 -266
#4 -144.5  311.5  200
#5   56.5  200.5 -314
#6  279.5 -111.5 -466

Another approach (admittedly more convoluted):

library(purrr)

df %>%
  by_row(function(x) {
    x[1:3] - df %>%
      filter(label == 1) %>%
      summarise_each(funs(mean), -label) },
    .collate = "cols",
    .labels = FALSE
  )

And perhaps the most dplyr-esque method (inspired by this post):

cm <- df %>%
  filter(label == 1) %>%
  summarise_each(funs(mean), -label) 

df %>% 
  mutate_each(funs(. - cm$. ), -label)

Which gives:

#    Col2   Col3 Col4 label
#1  134.5  -11.5 -232     1
#2 -134.5   11.5  232     1
#3  299.5  342.5 -266     0
#4 -144.5  311.5  200     0
#5   56.5  200.5 -314     0
#6  279.5 -111.5 -466     0
Community
  • 1
  • 1
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77