5

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -

set.seed(71)

DT <- data.table(town = rep(c('A','B'), each=10),
                 tc = rep(c('C','D'), 10),
                 one = rnorm(20,1,1),
                 two = rnorm(20,2,1),
                 three = rnorm(20,3,1),
                 four = rnorm(20,4,1),
                 five = rnorm(20,5,2),
                 six = rnorm(20,6,2),
                 seven = rnorm(20,7,2),
                 total = rnorm(20,28,3))

For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -

DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]

This DTnew data.table I am then merging with my DT

DTmerge <- merge(DT, DTnew, by= c('town','tc'))

Now to come up with the outliers, I am writing separate set of codes for each variable -

DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]

can some one help to simplify this code so that

  1. I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?

  2. In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.

    DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]

Look forward to some expert suggestions.

user1412
  • 709
  • 1
  • 8
  • 25
  • 1
    I know you're looking particularly for a `data.table` solution, but it might be helpful to note that what you're looking for is essentially a `data.table` equivalent to `dplyr`'s `mutate_each` – shrgm May 20 '16 at 06:14
  • @shreyasgm post it as an answer – eddi May 20 '16 at 15:41
  • @Prasad what's the point of doing this? why are you writing the same 4 numbers over and over again in a single group? – eddi May 20 '16 at 15:43

3 Answers3

5

We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.

nm <- names(DT)[-(1:2)]

nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))

DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
                                  SD = sd(x)
                     uplimit = Mean + 1.96*SD
                     lowlimit = Mean - 1.96*SD
             list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
                    .(town, tc)]

The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.

m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"), 
          with = FALSE] & DT[, nm, with = FALSE] <= DT[, 
            paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]

Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)

nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
 set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."), 
   value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] & 
           DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
 }

The 'log' columns can also be created with :=

DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you Akrun!! Could you also please explain the + part of the code. What actually + is doing and why are you using `with = FALSE`. Then the for loop, please elaborate more on the set part. why are we saying `i=NULL`? Also why is it necessary to give two square brackets `DT[[j]]`? Sorry I may be asking some basic questions but it would great if you could clarify. Also can you point to some good source to understand more on data.tables and examples on such manipulations? – user1412 May 20 '16 at 17:21
  • 1
    @Prasad The `+` is a hacky way of converting the logical matrix to binary matrix. It does the same way as `as.integer` or `as.numeric` but it will also preserve the structure. If you read the `?data.table`, it is mentioned about the `with = FALSE` for subsetting the columns. We are not setting any condition for the row as all the rows in the column are changed. If there were a particular condition like only change the values in rows where the rows have some value, we could have specify it in `i`. You can read more about the data.table by looking at the vignettes. – akrun May 20 '16 at 17:31
3

Your data should probably be in long format:

m = melt(DT, id=c("town","tc"))

Then just write your test once

m[, 
  is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]

I see no outliers in this data (according to the given definition of outlier):

m[, .N, by=is_outlier] # this is a handy alternative to table()

#    is_outlier   N
# 1:          0 160

How it works

  • melt keeps the id columns and stacks all the rest into
    • variable (column names)
    • value (column contents)
  • +x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0

If you really like your data in wide format, though:

vjs = setdiff(names(DT), c("town","tc"))
DT[, 
  paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Hi Frank, not very clear with the test code. why are we writing `+(abs(value-mean(value))`? and then in by part what is `variable`? – user1412 May 20 '16 at 17:25
  • @Prasad Sorry, forgot to explain those, but have edited in now. Let me know if anything else is unclear – Frank May 20 '16 at 17:38
0

For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:

library(dplyr)

result <- DT %>%
    group_by(town,tc) %>%
    mutate_each(funs(mean,sd,
                     uplimit = (mean(.) + 1.96*sd(.)),
                     lowlimit = (mean(.) - 1.96*sd(.)),
                     Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
                                               . <= mean(.) - 1.96*sd(.))),
                -town,-tc)
shrgm
  • 1,315
  • 1
  • 10
  • 20