I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.