4

I have a data table where I want to calculate the geometric mean for each row over several columns. Some of the values will have zeros so I need these to be excluded.

The geometric mean from Wiki is: "The geometric mean is defined as the nth root of the product of n numbers" so for 2 numbers its simply the square root of their product.

The nth root in my case will vary on each row depending on how many of the values are non-zeros in them.

In my example below the top 2 rows of the results column was worked out as follows:

1: (a * c)^(1/2)

2: (a * b * c)^(1/3)

so I need the formula to look at columns a:c, take the product of the non-zero values and then take the nth root of how many non-zero values there were.

library(data.table)
dt <- data.table(a = c(0.5, 0.3,0,0.6), b = c(0,0.4,0.1,0), 
c = c(0.9,0.5,0.1,0), Result = c(0.67, 0.39, 0.1, 0.6))
Vertexwahn
  • 7,709
  • 6
  • 64
  • 90
MidnightDataGeek
  • 938
  • 12
  • 21

4 Answers4

4

We can try using data.table methods

dt[, v1 := Reduce(`+`, lapply(.SD, function(x) x!=0)), .SDcols = 1:3]
dt[, result2 := round((Reduce(`*`, lapply(.SD, function(x) 
    replace(x, x==0, 1))))^(1/v1), 2), .SDcols = 1:3][, v1 := NULL][]
#    a   b   c Result result2
#1: 0.5 0.0 0.9   0.67    0.67
#2: 0.3 0.4 0.5   0.39    0.39
#3: 0.0 0.1 0.1   0.10    0.10
#4: 0.6 0.0 0.0   0.60    0.60

Or another less efficient option is to group by sequence of rows and then do it on each row

dt[, result2 := {
           u1 <- unlist(.SD)
           round(prod(u1[u1!=0])^(1/sum(u1!=0)), 2)} , 1:nrow(dt), .SDcols = 1:3]
dt
#     a   b   c Result result2
#1: 0.5 0.0 0.9   0.67    0.67
#2: 0.3 0.4 0.5   0.39    0.39
#3: 0.0 0.1 0.1   0.10    0.10
#4: 0.6 0.0 0.0   0.60    0.60

NOTE: Both of these are data.table methods.

Or another option contributed by @DavidArenburg

dt[, Result := round(Reduce(`*`, replace(.SD, .SD == 0, 1))^(1/rowSums(.SD != 0)), 2)]

Another vectorized option is to convert to matrix

library(matrixStats)
m1 <- as.matrix(setDF(dt)[1:3])
round(rowProds(replace(m1, !m1, 1))^(1/rowSums(m1!=0)), 2)
#[1] 0.67 0.39 0.10 0.60
akrun
  • 874,273
  • 37
  • 540
  • 662
  • the code block `dt[, result2....' returns an error...Error in eval(expr, envir, enclos) : object 'v1' not found – MidnightDataGeek Jan 19 '17 at 12:35
  • @MidnightDataGeek Which version of `data.table` you have? I am using `1.10.0` – akrun Jan 19 '17 at 12:35
  • I'm using 1.9.6. I've modified the answer below, is it still a 'data.table' method strictly speaking? Speed is key for me so I try and use DT for everything, – MidnightDataGeek Jan 19 '17 at 12:43
  • @MidnightDataGeek Okay, that is the reason, WIth 1.10.0, the other columns can also be accessed when using `.SDcols`. Yes, the first two are data.table methods. But, if you are using `apply`, it could convert to `matrix` and there is no gain in using `data.table` – akrun Jan 19 '17 at 12:44
  • 2
    Maybe a bit simpler ```dt[, Result := round(Reduce(`*`, replace(.SD, .SD == 0, 1))^(1/rowSums(.SD != 0)), 2)]``` – David Arenburg Jan 19 '17 at 12:51
  • @DavidArenburg That is nice, but are we allowed to `==` in `.SD` and `rowSums` with `data.table` – akrun Jan 19 '17 at 12:52
  • It's less recommended than `lapply` but your solution is pretty unreadable – David Arenburg Jan 19 '17 at 12:53
  • @DavidArenburg Thanks. Your solution is quick and easy to read. Amazing how you guys can come up with solutions so fast! – MidnightDataGeek Jan 19 '17 at 13:06
1

This will also work, assuming all non-negative values.

dt$Result <- apply(dt, 1, function(x) (prod(x[x!=0]))^(1/sum(x!=0)))
dt
#     a   b   c    Result
#1: 0.5 0.0 0.9 0.6708204
#2: 0.3 0.4 0.5 0.3914868
#3: 0.0 0.1 0.1 0.1000000
#4: 0.6 0.0 0.0 0.6000000
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • 1
    Thanks. I have modified it for my needs... `dt[, Result := apply(.SD, 1, function(x) (prod(x[x != 0])) ^ (1/sum(x!=0))), .SDcols = 1:3]` – MidnightDataGeek Jan 19 '17 at 12:36
1

prod(a)^(1/length(a)) gives geometric mean for vector a

Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
  • As stated here, the equation for geomean above that is being up voted is wrong! It is to the power 1/n. No “sum” should be there. I’m surprised no one previously upvoted this comment! – RichardBJ Jul 10 '21 at 16:12
-1

Other option:

m1 <- as.matrix(setDF(dt)[1:3])
exp(rowMeans(log(m1)))
Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103