9

I have a data table with 10 columns.

town    
tc  
one  
two  
three   
four    
five    
six  
seven   
total

Need to generate mean for columns "one" to "total" for which I am using,

DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]

This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean"

for (i in 3:10) {
  setnames(DTmean,i,paste0(names(i),"_mean"))
}
zyurnaidi
  • 2,143
  • 13
  • 14
user1412
  • 709
  • 1
  • 8
  • 25
  • 2
    Can you add your data to this question. You can use `dput` on your data and post the output. – steveb May 10 '16 at 15:04
  • 1
    @zyurnaidi using `names<-` on a `data.table` creates a copy. Check `address(DTmean)` before and after your proposed solution. `setnames` does not suffer from this. – MichaelChirico May 10 '16 at 15:48
  • Tip: When you ask a question, it is always good to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it easier for others to help you. – Jaap May 10 '16 at 16:27

1 Answers1

19

If you want to do it the data.table way, you should use setnames as follows:

setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

or:

cols <- names(DT)[3:10]
setnames(DTmean, cols, paste0(cols, '_mean'))

Furthermore, you don't need the .SDcols statement as you are aggregating all the other columns. Using DT[, lapply(.SD,mean), by = .(town,tc)] should thus give you the same result as using DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10].


On the following example dataset:

set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
                 tc = rep(c('C','D'), 10),
                 one = rnorm(20,1,1),
                 two = rnorm(20,2,1),
                 three = rnorm(20,3,1),
                 four = rnorm(20,4,1),
                 five = rnorm(20,5,2),
                 six = rnorm(20,6,2),
                 seven = rnorm(20,7,2),
                 total = rnorm(20,28,3))

using:

DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)]
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

gives:

> DTmean
   town tc  one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean
1:    A  C 1.7368898 1.883586   3.358440  4.849896  4.742609 5.089877   6.792513   29.20286
2:    A  D 0.8906842 1.826135   3.267684  3.760931  6.210145 7.320693   5.571687   26.56142
3:    B  C 1.4037955 2.474836   2.587920  3.719658  3.446612 6.510183   8.309784   27.80012
4:    B  D 0.8103511 1.153000   3.360940  3.945082  5.555999 6.198380   8.652779   28.95180

In reply to your comment: If you want to calculate both the mean and the sd simultanuously, you could do (adapted from my answer here):

DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]

which gives:

   town tc  one.mean    one.sd two.mean    two.sd three.mean  three.sd four.mean  four.sd five.mean   five.sd six.mean    six.sd seven.mean seven.sd total.mean total.sd
1:    A  C 0.2981842 0.3556520 1.578174 0.7788545   2.232366 0.9047046  4.896201 1.238877  4.625866 0.7436584 7.607439 1.7262628   7.949366 1.772771   28.94287 3.902602
2:    A  D 1.2099018 1.0205252 1.686068 1.5497989   2.671027 0.8323733  4.811279 1.404794  7.235969 0.7883873 6.765797 2.7719942   6.657298 1.107843   27.42563 3.380785
3:    B  C 0.9238309 0.6679821 2.525485 0.8054734   3.138298 1.0111270  3.876207 0.573342  3.843140 2.1991052 4.942155 0.7784024   6.783383 2.595116   28.95243 1.078307
4:    B  D 0.8843948 0.9384975 1.988908 1.0543981   3.673393 1.3505701  3.957534 1.097837  2.788119 1.9089660 6.463784 0.7642144   6.416487 2.041441   27.88205 3.807119

However, it is highly probable better to store this in long format. To get this you could use data.table's melt function as follows:

cols <- names(DT)[3:10]
DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)], 
            id.vars = c('town','tc'), 
            measure.vars = patterns('.mn','.sdev'),
            value.name = c('mn','sdev'))[, variable := cols[variable]]

or in a much simpler operation:

DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]

which results in:

> DT2
    town tc variable         mn      sdev
 1:    A  C      one  0.2981842 0.3556520
 2:    A  D      one  1.2099018 1.0205252
 3:    B  C      one  0.9238309 0.6679821
 4:    B  D      one  0.8843948 0.9384975
 5:    A  C      two  1.5781743 0.7788545
 6:    A  D      two  1.6860675 1.5497989
 7:    B  C      two  2.5254855 0.8054734
 8:    B  D      two  1.9889082 1.0543981
 9:    A  C    three  2.2323655 0.9047046
10:    A  D    three  2.6710267 0.8323733
11:    B  C    three  3.1382982 1.0111270
12:    B  D    three  3.6733929 1.3505701
.....

In response to your last comments, you can detect outliers as follows:

DT3 <- melt(DT, id.vars = c('town','tc'))
DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)
    ][, outlier := +(value < mn - sdev | value > mn + sdev)]

which gives:

     town tc variable      value         mn     sdev outlier
  1:    A  C      one  0.5681578  0.2981842 0.355652       0
  2:    A  D      one  0.5528128  1.2099018 1.020525       0
  3:    A  C      one  0.5214274  0.2981842 0.355652       0
  4:    A  D      one  1.4171454  1.2099018 1.020525       0
  5:    A  C      one  0.5820994  0.2981842 0.355652       0
 ---                                                        
156:    B  D    total 23.4462542 27.8820524 3.807119       1
157:    B  C    total 30.5934956 28.9524305 1.078307       1
158:    B  D    total 30.5618759 27.8820524 3.807119       0
159:    B  C    total 27.5940307 28.9524305 1.078307       1
160:    B  D    total 24.8378437 27.8820524 3.807119       0
Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thank you!! Your comment on SDcols also makes sense and I should avoid it. I was also trying to generate standard deviation(sd) of the columns 3:10.....but it seems not to be working as for the same set of columns I want to have mean and sd both keeping the original value. How can this be achieved? – user1412 May 10 '16 at 16:54
  • I probably can create a separate data table for sd. DTsd <- DT[, lapply(.SD,sd), by = .(town,tc)] setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_sd')) Is there a way to combine both in one datatable? – user1412 May 10 '16 at 16:59
  • Thank you Maximus! This was quite helpful. I extended by code to generate few natural logs and then wanted to check if the values are in a upper and lower limit ( 2 sigma method). Now I have achieved it by writing multiple lines of if conditions which I am pasting below, – user1412 May 12 '16 at 14:04
  • Thank you Maximus! This was quite helpful. I extended by code to generate few natural logs and then wanted to check if the values are in a upper and lower limit ( 2 sigma method). Now I have achieved it by writing multiple lines of if conditions which for each column I am pasting below,Is there a away to write this in couple of lines? – user1412 May 12 '16 at 14:13
  • 'DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one.log >= one.log.lowlimit & one.log <= one.log.uplimit,0,1)] DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two.log >= two.log.lowlimit & two.log <= two.log.uplimit,0,1)] DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three.log >= three.log.lowlimit & three.log <= three.log.uplimit,0,1)]' So I have written like this for each column. The order of columns is one.mean, one.sd, one.log.uplimit, one.log.lowlimit and so on for other columns – user1412 May 12 '16 at 14:14
  • Also can you please suggest a good resource for understanding more on data.table. I have just started using it and find it quite useful and powerful – user1412 May 12 '16 at 14:15
  • @Prasad As I suggested in the last part of my answer, it is better to reshape the results into long format. This makes it a lot easier to do further calculations with it. For a good resource on `data.table`, see the [*"Getting Started"*](https://github.com/Rdatatable/data.table/wiki/Getting-started) pages on Github. – Jaap May 12 '16 at 14:23
  • Thank you Maximus for sharing the link. My manager wants me to covert shorten the code for outlier using loop. I have been trying but not successful. His argument is what is there are 100 variables, will we end by writing 100 lines of ifelse code. do you think you can suggest some way to reduce this code for the outlier? current code is...... – user1412 May 19 '16 at 11:29
  • `DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one.log >= one.log.lowlimit & one.log <= one.log.uplimit,0,1)] DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two.log >= two.log.lowlimit & two.log <= two.log.uplimit,0,1)] DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three.log >= three.log.lowlimit & three.log <= three.log.uplimit,0,1)]` – user1412 May 19 '16 at 11:29
  • @Prasad See the update. Note: I used another measure to detect the outlier, but I hope you will be able to apply the method to your specific situation. – Jaap May 22 '16 at 17:13
  • Thank you!! this was quite interesting and I managed to replicate it. One thing I wanted to check is why did we use `+` in the code `outlier := +`? Then by using melt, we end up creating a new data.table. now is there a way to merge the outlier back in original file with respective outlier columns and their values in them. Some thing like **case to vars in SPSS**. – user1412 May 23 '16 at 13:08
  • @Prasad Using `+` is actually a tric, it is shorthand for converting a `TRUE/FALSE` variable to logical numeric variable with `1` for `TRUE` and `0` for `FALSE`. The more idomatic way is to use `as.integer` or `as.numeric`. With regard to merging the outlier-indicator variable back into the original dataset, see [this Q&A about joining and merging](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) (for which you also might need to convert `DToutlier` to wide as well). – Jaap May 23 '16 at 13:24
  • Thank you. In order to merge the data I would first need to restructure the data. So I want to re-structure this data so that the restructured data has same records and these variable each for "mn", "sdev", "out". How can I achieve this? I was trying dcast but not very clear of `~` and `+` how to use in the formula.....not clearly mentioned in ?dcast and other notes. Do you have something to share which explains this with an example. can the requirement be achieved using dcast? – user1412 May 25 '16 at 11:59