2

I am trying to figure out how to calculate the average,median and standard deviation for each value of each variable. Here is some of the data (thanks to @Barranka for providing the data in a easy-to-copy format):

df <- data.frame(
  gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10), 
  theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1), 
  detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1), 
  NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40), 
  NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30), 
  NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10), 
  sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15), 
  lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33), 
  PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5), 
  NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1), 
  Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333), 
  AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437), 
  AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)

To save it into csv later, i have to make it into data frame that supposed to be like this

Table for gama
Value   Average        Median           Standard Deviation
10      (Avg of 10)    (median of 10)   (Stdev of 10)
1       (Avg of 1)     (median of 1)    (Stdev of 1)
0.1     (Avg of 0.1)   (median of 0.1)  (Stdev of 0.1)

Table for theta
Value   Average        Median           Standard Deviation
1      (Avg of 10)    (median of 10)    (Stdev of 10)
0.65   (Avg of 0.65)  (median of 0.65)  (Stdev of 0.65)

Table for detectionsLimit
Value   Average        Median           Standard Deviation
3      (Avg of 3)     (median of 3)     (Stdev of 3)
5      (Avg of 5)     (median of 5)     (Stdev of 5)
...

The columns to be used as ID's are:

ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")

Summary statistics should be computed over the following columns:

vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")

I have tried using data.table package/function, but I cannot figuring out how to develop an approach using data.table without renaming values one by one; also, when pursuing this approach, my code gets very complicated.

rbatt
  • 4,677
  • 4
  • 23
  • 41
user3292755
  • 383
  • 2
  • 9
  • 25
  • Please show us what you've tried and post a [MVCE](http://stackoverflow.com/help/mcve). Also, [this post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) might help you s well. – Richard Erickson Aug 12 '15 at 17:00
  • 2
    average, median and standard deviation of which columns? – Barranka Aug 12 '15 at 17:31
  • @Barranka I made some assumption about which columns, and submitted them as an edit. – rbatt Aug 12 '15 at 19:34

2 Answers2

3

Clever use of melt() and tapply() can help you. I made the following assumptions:

  1. You have to get the mean, median and average of the last three columns
  2. You need to group the data for each of the first ten columns (gama, theta, ..., negTrades)

For reproducibility, here's the input:

# Your example data
df <- data.frame(
  gama=c(10, 1, 1, 1, 1, 1, 10, 0.1, 10), 
  theta=c(1, 1, 1, 1, 0.65, 1, 0.65, 1, 1), 
  detectl=c(3, 5, 1, 1, 5, 3, 5, 5, 1), 
  NSMOOTH=c(10, 5, 20, 20, 5, 20, 10, 10, 40), 
  NREF=c(50, 80, 80, 50, 80, 50, 10, 100, 30), 
  NOBS=c(10, 40, 40, 20, 20, 20, 10, 40, 10), 
  sma=c(15, 15, 15, 15, 15, 15, 15, 15, 15), 
  lma=c(33, 33, 33, 33, 33, 33, 33, 33, 33), 
  PosTrades=c(11, 7, 6, 3, 9, 3, 6, 6, 5), 
  NegTrades=c(2, 2, 1, 0, 1, 0, 1, 5, 1), 
  Acc=c(0.846154, 0.777778, 0.857143, 1, 0.9, 1, 0.857143, 0.545455, 0.833333), 
  AvgWin=c(0.0451529, 0.0676022, 0.0673241, 0.13204, 0.0412913, 0.126522, 0.0630061, 0.0689745, 0.0748437), 
  AvgLoss=c(-0.0194498, -0.0083954, -0.0174653, NaN, -0.00264179, NaN, -0.0161558, -0.013903, -0.0278908), Return=c(1.54942, 1.54916, 1.44823, 1.44716, 1.42789, 1.42581, 1.40993, 1.38605, 1.38401)
)

And here's my proposed solution:

library(reshape)
md <- melt(df, id=colnames(df)[1:10]) # This will create one row for each
                                      # 'id' combination, and will store
                                      # the rest of the column headers
                                      # in the `variable` column, and 
                                      # each value corresponding to the
                                      # variable. Like this:
head(md)
##   gama theta detectl NSMOOTH NREF NOBS sma lma PosTrades NegTrades variable    value
## 1   10  1.00       3      10   50   10  15  33        11         2      Acc 0.846154
## 2    1  1.00       5       5   80   40  15  33         7         2      ## Acc 0.777778
## 3    1  1.00       1      20   80   40  15  33         6         1      ## Acc 0.857143
## 4    1  1.00       1      20   50   20  15  33         3         0      ## Acc 1.000000
## 5    1  0.65       5       5   80   20  15  33         9         1      ## Acc 0.900000
## 6    1  1.00       3      20   50   20  15  33         3         0      ## Acc 1.000000
results <- list() # Prepare the results list
for(i in unique(md$variable)) { # For each variable you have...
  results[[i]] <- list()        # ... create a new list to hold the 'summary'
  tmp_data <- subset(md, variable==i) # Filter the data you'll use
  for(j in colnames(tmp_data)[1:10]) { # For each variable, use tapply()
                                       # to get what you need, and
                                       # store it into a data frame
                                       # inside the results
    results[[i]][[j]] <- as.data.frame(
                               t(
                                 rbind(
                                   tapply(tmp_data$value, tmp_data[,j], mean), 
                                   tapply(tmp_data$value, tmp_data[,j], median), 
                                   tapply(tmp_data$value, tmp_data[,j], sd))
                                 )
                              )
    colnames(results[[i]][[j]]) <- c('average', 'median', 'sd')
  }
  rm(tmp_data) # You'll no longer need this
}

Now what? Check out the summary for results:

summary(results)
##         Length Class  Mode
## Acc     10     -none- list
## AvgWin  10     -none- list
## AvgLoss 10     -none- list
## Return  10     -none- list

You have a list for each variable. Now, if you check out the summary for any results "sublist", you'll see this:

summary(results$Acc)
##           Length Class      Mode
## gama      3      data.frame list
## theta     3      data.frame list
## detectl   3      data.frame list
## NSMOOTH   3      data.frame list
## NREF      3      data.frame list
## NOBS      3      data.frame list
## sma       3      data.frame list
## lma       3      data.frame list
## PosTrades 3      data.frame list
## NegTrades 3      data.frame list

See what happens when you peek into the results$Acc$gama list:

results$Acc$gama
##      average   median         sd
## 0.1 0.5454550 0.545455         NA
## 1   0.9069842 0.900000 0.09556548
## 10  0.8455433 0.846154 0.01191674

So, for each variable and each "id" column, you have the data summary you want.

Hope this helps.

Barranka
  • 20,547
  • 13
  • 65
  • 83
1

I have an approach involving data.table.

EDIT: I tried to submit an edit to the question, but I took some liberties so it'll probably get rejected. I made assumptions about which columns were to be used as "id" columns (columns whose values subset data), and which should be "measure" columns (columns whose values are used to calculate the summary statistics). See here for these designations:

ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")

Setup

# Convert to data.table
df <- data.table(df)

# Helper function to convert a string to a call
# useful in a data.table j
s2c <- function (x, type = "list"){
    as.call(lapply(c(type, x), as.symbol))
}

# Function to computer the desired summary stats
smry <- function(x) list(Average=mean(x, na.rm=T), Median=median(x, na.rm=T), StandardDeviation=sd(x, na.rm=T))

# Define some names to use later
ids <- c("gama", "theta","detectl", "NSMOOTH", "NREF", "NOBS", "sma", "lma")
vals <- c("PosTrades", "NegTrades", "Acc", "AvgWin", "AvgLoss", "Return")
usenames <- paste(rep(c("Average","Median","StdDev"),each=length(vals)), vals,sep="_")

Calculations in data.table

# Compute the summary statistics
df2 <- df[,j={
    for(i in 1:length(ids)){ # loop through each id
        t.id <- ids[i]
        t.out <- .SD[,j={
            t.vals <- .SD[,eval(s2c(vals))] # this line returns a data.table with each vals as a column
            sapply(t.vals, smry) # apply summary statistics
        },by=t.id] # this by= loops through each value of the current id (t.id)
        setnames(t.out, c("id.val", usenames)) # fix the names of the data.table to be returned for this i
        t.out <- cbind(id=t.id, t.out) # add a column indicating the variable name (t.id)
        if(i==1){big.out <- t.out}else{big.out <- rbind(big.out, t.out)} # accumulate the output data.table
    }
    big.out
}]

Formatting

df2 <- data.table:::melt.data.table(df2, id.vars=c("id","id.val")) # melt into "long" format

df2[,c("val","metric"):=list(gsub(".*_","",variable),gsub("_.*","",variable))] # splice names to create id's

df2[,variable:=NULL] # delete old column that had the names we just split up
df2 <- data.table:::dcast.data.table(df2, id+id.val+val~metric) # go a bit wider, so stats in diff columns
# reshape2:::acast(df2, id+id.val~metric~val) # maybe replace the above line with this

Result

        id id.val       val  Average     Median       StdDev
  1:  NOBS     10       Acc 3.214550 0.01191674  0.006052701
  2:  NOBS     10   AvgLoss 1.000000 0.06300610  1.409930000
  3:  NOBS     10    AvgWin 1.333333 0.06100090  1.447786667
  4:  NOBS     10 NegTrades 6.000000 0.84615400 -0.019449800
  5:  NOBS     10 PosTrades 7.333333 0.84554333 -0.021165467
 ---                                                        
128: theta      1   AvgLoss 1.000000 0.06897450  1.447160000
129: theta      1    AvgWin 1.571429 0.08320849  1.455691429
130: theta      1 NegTrades 6.000000 0.84615400 -0.017465300
131: theta      1 PosTrades 5.857143 0.83712329 -0.017420860
132: theta      1    Return 1.718249 0.03285638  0.068957635
rbatt
  • 4,677
  • 4
  • 23
  • 41
  • Thanks for the help, i have done the code for now, but your code is helpful because the ID helps the identification of specific values. Thanks again – user3292755 Aug 13 '15 at 04:16
  • @user3292755 I had to do some guessing related to what you wanted – I'm curious, did I do what you wanted? – rbatt Aug 13 '15 at 12:14
  • You're guessing is correct and i used your code to customize it again because in the next step i have to save this file into excels. Thank you so much for your effort – user3292755 Aug 13 '15 at 15:16