0

I want to calculate the mean of the absolute value of all numerical columns for the example dataset DT:

library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(50,50),                                                    # Creates a panel ID
                      Country = c(rep("Albania",30),rep("Belarus",50), rep("Chilipepper",20)),       
                      some_NA = sample(0:5, 6),                                             
                      some_NA_factor = sample(0:5, 6),         
                      Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      Income = round(rnorm(10,-5,5),2),
                      Happiness = sample(10,10),
                      Sex = round(rnorm(10,0.75,0.3),2),
                      Age = sample(100,100),
                      Educ = round(rnorm(10,0.75,0.3),2))           
DT [, uniqueID := .I]                                                                        # Creates a unique ID     
DT[DT == 0] <- NA                                                                            # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)

I tried to calculate the means and the absolute means as follows:

mean_of_differences <- DT[,lapply(Filter(is.numeric,.SD),mean, na.rm=TRUE)]
mean_of_differences <- as.data.frame(t(mean_of_differences))
mean_of_differences <- round(mean_of_differences, digits=2)
mean_of_absolute_diff <- DT[,lapply(Filter(is.numeric,.SD),function(x) mean(abs(x),na.rm=TRUE))]
mean_of_absolute_diff <- as.data.frame(t(mean_of_absolute_diff))
mean_of_absolute_diff <- round(mean_of_differences, digits=2)

The mean of Income for the absolute differences is however negative (as it is for the normal mean), which obviously is not possible. If I look at my code I don't understand what I am doing wrong. What am I overlooking?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • The absolute means are for me. I however didn't run the parts with `as.data.frame` although I don't think that would make a difference. – NelsonGon Aug 29 '19 at 08:21
  • That is very weird. I have had this problem with multiple datasets. Is there an alternative way to achieve this that I could try? – Tom Aug 29 '19 at 08:23
  • 1
    Try this with `dplyr`(with DT as a data.table object): `DT %>% summarise_if(is.numeric, function(x) mean(abs(x), na.rm=TRUE))` – NelsonGon Aug 29 '19 at 08:26
  • 1
    The `dplyr` code works. Extremely weird (and annoying) though. – Tom Aug 29 '19 at 08:36
  • The [tag:dplyr]-code of @NelsonGon gives the same as the [tag:data.table]-code of Tom for me. – Jaap Aug 29 '19 at 08:46
  • I reinstalled R, Rstudio, and all of the packages. I am still getting this outcome. Also the `dplyr` solution does not work on my actual dataset. Is there anything else you guys can think of? – Tom Aug 29 '19 at 10:53
  • I notice that the positive sign switches to negative after doing `mean_of_absolute_diff <- round(mean_of_differences, digits=2)`. – Tom Aug 29 '19 at 10:59
  • @Tom,if possible you could share a dput of your actual dataset. – NelsonGon Aug 29 '19 at 11:44
  • Copy and paste error in the last line :) – s_baldur Sep 04 '19 at 13:01

1 Answers1

0

Here is a solution using data.table. It (i) identifies numeric columns and (ii) obtains the mean of the absolute value of each numeric column.

Data

dt = data.table(
num1 = rnorm(100),
num2 = rnorm(100),
strv = sample(LETTERS, 100, replace = T)
)

Code

numcols = colnames(dt)[unlist(lapply(dt, is.numeric))] # Which columns are numeric?

# > numcols
# [1] "num1" "num2"

meandt = dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = numcols]
newcols = paste('mean_abs_', numcols, sep = ''); colnames(meandt) = newcols

# > meandt
#        mean_abs_num1 mean_abs_num2
# 1:     0.8287523     0.8325123
JDG
  • 1,342
  • 8
  • 18