1

I am calculating the Tukey outlier detection algorythm on a data set of prices.

The thing is that I need it to be calculated by group (another variable included in the same data set), which works perfectly fine with the aggregate command up until I need to calculate a mean using only the data between percentile 5 to the median and one using only the data from the median to percentile 95.

As far as I know, the command goes this way: aggregate(doc$x, by=list(doc$group), FUN=mean, trim = 0.05), if the mean was trimmed symmetrically taking the upper and lower 5% (total 10%) from the data before printing the result. I don't know how to go through the next steps where I need to calculate the upper and lower mean taking the median as a division point, still keeping the upper and lower 5% off.

medlow <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim =c(0.05,0.5))
medup <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim =c(0.5,0.95))

medtrunc <- aggregate(doc1$`rp`, by=list(doc1$`Código Artículo`), FUN=mean,trim = 0.05)

I expect the output to be the number I need for each group, but it goes

Error in mean.default(X[[i]], ...) : 'trim' must be numeric of length one.

Oliver
  • 8,169
  • 3
  • 15
  • 37
MelaniaCB
  • 427
  • 5
  • 16

1 Answers1

0

First, I think you are using aggregate and trim the wrong way. 'trim' must be numeric of length one means that you can only exclude a particular fraction of data from both upper and lower tails of the distribution:

df = data.frame(
  gender = c(
    "male","male","male","male","female","female","female", "female"
    ),
  score = rnorm(8, 10, 2)
  )
aggregate(score ~ gender, data = df, mean, trim = 0.1)

  gender     score
1 female 11.385263
2   male  9.954465

For the splitting based on the median and calculating trimmed mean for the split data, you can easily split your data frame by making a new variable MedianSplit by a simple for loop:

df$MedianSplit <- 0
for (i in 1:nrow(df)) {
  if (df$score[i] <= median(df$score)) {
    df$MedianSplit[i] = "lower" 
  } else {
    df$MedianSplit[i] = "upper"
  }
}

df



gender     score MedianSplit
1   male  7.062605       lower
2   male  9.373052       upper
3   male  6.592681       lower
4   male  7.298971       lower
5 female  7.795813       lower
6 female  7.800914       upper
7 female 12.431028       upper
8 female 10.661753       upper

Then, use aggregate to compute the trimmed means:

For data below than median (i.e., [0, 0.5])

aggregate(
  score ~ gender, 
  data = df[ which(df$MedianSplit == "lower"), ], 
  mean, trim = 0.05
)

  gender    score
1 female 7.795813
2   male 6.984752

and for those above the median (i.e., [0.5, 1]):

aggregate( score ~ gender, data = df[ which(df$MedianSplit == "upper"), ], mean, trim = 0.05 )

  gender     score
1 female 10.297898
2   male  9.373052
maaniB
  • 595
  • 6
  • 23
  • To make it fit, imagine that you have males and females from different places, 270 in total. So you want to get all those numbers for each place and that's why I was trying to use 'aggregate' to help me simplify that coding. Plus at the end, the trim wouldn't match because I need the mean from 0.05 to the median (0.5), not 0.45 and likewise for the upper side. – MelaniaCB Aug 25 '19 at 14:24
  • @MelaniaCB please edit your question and provide a minimal example using `dput`. If you have two categorical variables or more, using `dplyr` and `group_by` followed by `mutate_at` will help you more than you can imagine. I think a `tidyverse` approach suits your condition – maaniB Aug 25 '19 at 14:36
  • Could make it by creating the function infmean <- function( x ){ sort(x) inf <- mean(quantile(x, 0.05):median(x)) return(inf) } And using aggregate medinf <- aggregate(doc$`x`, by=list(doc$`Group`), FUN=infmean) – MelaniaCB Aug 28 '19 at 04:44