119

I tried to find a built-in for geometric mean but couldn't.

(Obviously a built-in isn't going to save me any time while working in the shell, nor do I suspect there's any difference in accuracy; for scripts I try to use built-ins as often as possible, where the (cumulative) performance gain is often noticeable.

In case there isn't one (which I doubt is the case) here's mine.

gm_mean = function(a){prod(a)^(1/length(a))}
smci
  • 32,567
  • 20
  • 113
  • 146
doug
  • 69,080
  • 24
  • 165
  • 199
  • 11
    Careful about negative numbers and overflows. prod(a) will under or overflow very quickly. I tried to time this using a big list and quickly got Inf using your method vs 1.4 with exp(mean(log(x))); the rounding problem can be quite severe. – Tristan Apr 08 '10 at 22:12
  • i just wrote the function above quickly because i was sure that 5 min after posting this Q, someone would tell me R's built-in for gm. So no built-in so it's certain worth taking the time to re-code in light of your remarks. + 1 from me. – doug Apr 08 '10 at 23:12
  • 2
    I just tagged this [tag:geometric-mean] and [tag:built-in], 9 years later. – smci Jan 12 '19 at 15:55

9 Answers9

97

No, but there are a few people who have written one, such as here.

Another possibility is to use this:

exp(mean(log(x)))
PatrickT
  • 10,037
  • 9
  • 76
  • 111
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Another advantage of using exp(mean(log(x))) is that you can work with long lists of large numbers, which is problematic when using the more obvious formula using prod(). Note that prod(a)^(1/length(a)) and exp(mean(log(a))) give the same answer. – lukeholman Feb 23 '15 at 04:45
  • the link has been fixed – PatrickT Sep 26 '18 at 18:11
91

Here is a vectorized, zero- and NA-tolerant function for calculating geometric mean in R. The verbose mean calculation involving length(x) is necessary for the cases where x contains non-positive values.

gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}

Thanks to @ben-bolker for noting the na.rm pass-through and @Gregor for making sure it works correctly.

I think some of the comments are related to a false-equivalency of NA values in the data and zeros. In the application I had in mind they are the same, but of course this is not generally true. Thus, if you want to include optional propagation of zeros, and treat the length(x) differently in the case of NA removal, the following is a slightly longer alternative to the function above.

gm_mean = function(x, na.rm=TRUE, zero.propagate = FALSE){
  if(any(x < 0, na.rm = TRUE)){
    return(NaN)
  }
  if(zero.propagate){
    if(any(x == 0, na.rm = TRUE)){
      return(0)
    }
    exp(mean(log(x), na.rm = na.rm))
  } else {
    exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
  }
}

Note that it also checks for any negative values, and returns a more informative and appropriate NaN respecting that geometric mean is not defined for negative values (but is for zeros). Thanks to commenters who stayed on my case about this.

Paul 'Joey' McMurdie
  • 7,295
  • 5
  • 37
  • 41
  • 2
    wouldn't it be better to pass `na.rm` through as an argument (i.e. let the user decide whether they want to be NA-tolerant or not, for consistency with other R summary functions)? I'm nervous about automatically excluding zeroes -- I would make that an option as well. – Ben Bolker Aug 28 '14 at 19:21
  • 1
    Perhaps you're right about passing `na.rm` as an option. I'll update my answer. As for excluding zeroes, the geometric mean is undefined for non-positive values, including zeroes. The above is a common fix for geometric mean, in which zeroes (or in this case all non-zeroes) are given a dummy value of 1, which has no effect on the product (or equivalently, zero in the logarithmic sum). – Paul 'Joey' McMurdie Aug 28 '14 at 20:01
  • *I meant a common fix for non-positive values, zero being the most common when geometric mean is being used. – Paul 'Joey' McMurdie Aug 28 '14 at 20:09
  • I thought the effect of a zero would be (as pointed out by @Alan-James-Salmoni below) to force the GM to zero, i.e. `result <- if(any(x==0)) 0 else exp(sum(...))` – Ben Bolker Aug 28 '14 at 20:16
  • 1
    Your `na.rm` pass-through doesn't work as coded... see `gm_mean(c(1:3, NA), na.rm = T)`. You need to remove the `& !is.na(x)` from the vector subset, and since the first arg of `sum` is `...`, you need to pass `na.rm = na.rm` by name, and you also need to exclude `0`'s and `NA`'s from the vector in the `length` call. – Gregor Thomas Aug 28 '14 at 20:53
  • Gregor, good catch on the `...` As for excluding `0`s and `NA`'s -- nope that is as I meant it. I explained above about non-positive values. There's a reason it is `sum(x_trim)/length(x)` and not simply `mean(x_trim)`; where `x_trim` is the non-positive-removed vector – Paul 'Joey' McMurdie Aug 28 '14 at 22:01
  • @PaulMcMurdie As is, `gm_mean(1:3)` gives 1.8, but `gm_mean(c(1:3, NA))` gives 1.6. The presence of `NA` has no effect on the sum, but it does have effect on the length. – Gregor Thomas Aug 29 '14 at 19:40
  • But even if that behavior is desired, because you always exclude `NA` from `x` *before* the sum, your `na.rm` does nothing. `gm_mean(c(1:3, NA))` gives the same result (1.565) for both `na.rm = T` and `na.rm = F`. – Gregor Thomas Aug 29 '14 at 19:43
  • @Gregor I see your point, just removed the `!is.na(x)` from the extractor expression. – Paul 'Joey' McMurdie Aug 29 '14 at 22:13
  • why are you excluding the negative numbers? I think you should return `NaN` if the number of non-NA elements is even and the product is negative, and return a number in all other cases (including all of `x`); also I think you should normalize by the number of non-NA `x`'s, just like regular `mean` does - as-is you're not *removing* NA, but instead substituting it with 1 – eddi Feb 23 '15 at 00:16
  • @eddi Per wikipedia: "The [geometric mean](http://en.wikipedia.org/wiki/Geometric_mean) only applies to positive numbers in order to avoid taking the root of a negative product, which would result in imaginary numbers, and also to satisfy certain properties about means [...] Note that the definition is unambiguous if one allows 0 (which yields a geometric mean of 0), but may be excluded, as one frequently wishes to take the logarithm of geometric means (to convert between multiplication and addition), and one cannot take the logarithm of 0." – Paul 'Joey' McMurdie Feb 26 '15 at 20:58
  • Won't this code always fail at `if(any(x < 0))` if the input `x` contains NAs ? – michael Mar 19 '15 at 06:55
  • Yep. I missed the edge case when there are no values less than zero AND also NAs. If there are values less than zero then the test works fine. E.g. `if( any(c(NA, 0, 1, 1) < 0) ){message("yay")}` returns instead: `Error in if (any(c(NA, 0, 1, 1) < 0)) { : missing value where TRUE/FALSE needed`. I've updated the example so that `any` as `na.rm = TRUE`. This should handle all but the corner case where you're trying to take the geometric mean of all `NA`s... I'm comfortable with that O:-) – Paul 'Joey' McMurdie Mar 23 '15 at 17:55
  • @Gregory it is incorrect to use NA in the length when na.rm=T since by definition you are removing the missing value. If missing values are kept, then the gm_mean value shall be NA since you have NO clue what the geomean is due to the missing value. – Chris Aug 26 '16 at 04:03
  • 3
    Beware: for `x` containing only zero(s), like `x <- 0`, `exp(sum(log(x[x>0]), na.rm = TRUE)/length(x))` gives `1` for the geometric mean, which doesn't make sense. – adatum Jan 21 '17 at 22:40
  • 1
    Assuming na.rm = TRUE, wouldn't it have to be something like length(x[!is.na(x) & x > 0])? – Nikos Bosse Apr 23 '21 at 07:44
19

We can use psych package and call geometric.mean function.

zx8754
  • 52,746
  • 12
  • 114
  • 209
AliCivil
  • 2,003
  • 6
  • 28
  • 43
12

The

exp(mean(log(x)))

will work unless there is a 0 in x. If so, the log will produce -Inf (-Infinite) which always results in a geometric mean of 0.

One solution is to remove the -Inf value before calculating the mean:

geo_mean <- function(data) {
    log_data <- log(data)
    gm <- exp(mean(log_data[is.finite(log_data)]))
    return(gm)
}

You can use a one-liner to do this but it means calculating the log twice which is inefficient.

exp(mean(log(i[is.finite(log(i))])))
James Salmoni
  • 121
  • 1
  • 5
  • why calculate the log twice when you can do: exp(mean(x[x!=0])) – zzk Jul 25 '14 at 20:54
  • 1
    both approaches get the mean wrong, because the denominator for the mean, `sum(x) / length(x)` is wrong if you filter x and then pass it to `mean`. – Paul 'Joey' McMurdie Aug 28 '14 at 17:46
  • I think filtering is a bad idea unless you explicitly mean to do it (e.g. if I were writing a *general-purpose* function I would not make filtering the default) -- OK if this is a one-off piece of code and you've thought very carefully about what filtering zeroes out actually means in the context of your problem (!) – Ben Bolker Aug 28 '14 at 20:18
  • 1
    By definition a geometric mean of a set of numbers containing zero should be zero! http://math.stackexchange.com/a/91445/221143 – Chris Aug 26 '16 at 04:00
6

I use exactly what Mark says. This way, even with tapply, you can use the built-in mean function, no need to define yours! For example, to compute per-group geometric means of data$value:

exp(tapply(log(data$value), data$group, mean))
Tomas
  • 57,621
  • 49
  • 238
  • 373
4

The EnvStats package has a function for geoMean and geoSd.

zx8754
  • 52,746
  • 12
  • 114
  • 209
PrinzvonK
  • 93
  • 6
3

In case there is missing values in your data, this is not a rare case. you need to add one more argument.

You may try following code:

exp(mean(log(i[ is.finite(log(i)) ]), na.rm = TRUE))
zx8754
  • 52,746
  • 12
  • 114
  • 209
Tian Yi
  • 31
  • 1
3

This version provides more options than the other answers.

  • It allows the user to distinguish between results that are not (real) numbers and those that are not available. If negative numbers are present, then the answer won't be a real number, so NaN is returned. If it's all NA values then the function will return NA_real_ instead to reflect that a real value is literally not available. This is a subtle difference, but one that might yield (slightly) more robust results.

  • The first optional parameter zero.rm is intended to allow the user to have zeros affect the output without making it zero. If zero.rm is set to FALSE and eta is set to NA_real_ (its default value), zeros have the effect of shrinking the result towards one. I don't have any theoretical justification for this - it just seems to make more sense to not ignore the zeros but to "do something" that doesn't involve automatically making the result zero.

  • eta is a way of handling zeros that was inspired by the following discussion: https://support.bioconductor.org/p/64014/

geomean <- function(x,
                    zero.rm = TRUE,
                    na.rm = TRUE,
                    nan.rm = TRUE,
                    eta = NA_real_) {
    nan.count <- sum(is.nan(x))
     na.count <- sum(is.na(x))
  value.count <- if(zero.rm) sum(x[!is.na(x)] > 0) else sum(!is.na(x))

  #Handle cases when there are negative values, all values are missing, or
  #missing values are not tolerated.
  if ((nan.count > 0 & !nan.rm) | any(x < 0, na.rm = TRUE)) {
    return(NaN)
  }
  if ((na.count > 0 & !na.rm) | value.count == 0) {
    return(NA_real_)
  }

  #Handle cases when non-missing values are either all positive or all zero.
  #In these cases the eta parameter is irrelevant and therefore ignored.
  if (all(x > 0, na.rm = TRUE)) {
    return(exp(mean(log(x), na.rm = TRUE)))
  }
  if (all(x == 0, na.rm = TRUE)) {
    return(0)
  }

  #All remaining cases are cases when there are a mix of positive and zero
  #values.
  #By default, we do not use an artificial constant or propagate zeros.
  if (is.na(eta)) {
    return(exp(sum(log(x[x > 0]), na.rm = TRUE) / value.count))
  }
  if (eta > 0) {
    return(exp(mean(log(x + eta), na.rm = TRUE)) - eta)
  }
  return(0) #only propagate zeroes when eta is set to 0 (or less than 0)
}
  • 2
    Can you add some details explaining how this differs from/improves on existing solutions? (I personally wouldn't want to add a heavy dependency like `dplyr` for such a utility unless necessary ...) – Ben Bolker Jan 08 '20 at 00:17
  • I agree, the ```case_when```s were a little silly, so I removed them and the dependency in favor of ```if```s. I also provided some elaboration. – Chris Coffee Jan 08 '20 at 21:41
  • 2
    I went with your latter idea and changed the default of ```nan.rm``` to ```TRUE``` to align all three ```.rm`` parameters. – Chris Coffee Jan 08 '20 at 22:19
  • 1
    One other stylistic nitpick. `ifelse` is designed for vectorization. With a single condition to check, it would be more idiomatic to use `value.count <- if(zero.rm) sum(x[!is.na(x)] > 0) else sum(!is.na(x))` – Gregor Thomas Jan 09 '20 at 04:23
  • It looks nicer than ```ifelse```, too. Changed. Thanks! – Chris Coffee Jan 09 '20 at 22:01
1
exp(mean(log(x1))) == prod(x1)^(1/length(x1))
zx8754
  • 52,746
  • 12
  • 114
  • 209