Conditional calculation of means of different columns in data.table with R

Question

Here was discussed the question of calculation of means and medians of vector t, for each value of vector y (from 1 to 4) where x=1, z=1, using aggregate function in R.

Multiple aggregation in R with 4 parameters

But how can I for each value (from 1 to 5) of vector x calculate (mean(y)+mean(z))/(mean(z)-mean(t)) ? And do not make calculations for values 0 and NA in any vector. For example, in vector y the 3rd value is 0, so the 3rd number in every vector (y,z,t) should not be used. And in result the the third row (for x=3) should be NA.

Here is the code for calculating means of y,z and t and it`s needed to add the formula for calculation (mean(y)+mean(z))/(mean(z)-mean(t)):

data <- data.table(dataframe)
bar <- data[,.N,by=x]
foo <- data[ ,list(mean.y  =mean(y, na.rm = T),
                   mean.z=mean(z, na.rm = T),
                   mean.t=mean(t,na.rm = T)),
             by=x]

In this code for calculating means all rows are used, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)), any row where y or z or t equal to zero or NA should not be used.

Does it need to be a `data.table` solution specifically? Also what do you mean by "in result the the third row (for x=3) should be NA"? — Kara Woo, Jun 19 '14 at 02:09
@KaraWoo No, it is not necessary to use data.table ? But I use csv file for data. So result should be in form of vector with 5 values (for each number of X). For example, for x=1, we have two rows of numbers, but for calculating (mean(y)+mean(z))/(mean(z)-mean(t)) for x=1, we should use only first row for every vector, since in the second row y=0. For x=5 we have 4 rows of data,and one of them contains zeros. So, for calculating (mean(y)+mean(z))/(mean(z)-mean(t)) we should use only 4 last rows of every vector. Thank you. — user3742663, Jun 19 '14 at 02:34

Arun · Accepted Answer · 2014-07-03T08:38:59.327

Update:

Oh, this can be further simplified, as data.table doesn't subset NA by default (especially with such cases in mind, similar to base::subset). So, you just have to do:

dt[y != 0 & z != 0 & t != 0, 
    list(ans = (mean(y) + mean(z))/(mean(z) - mean(t))), by = x]

FWIW, here's how I'd do it in data.table:

dt[(y | NA) & (z | NA) & (t | NA), 
        list(ans=(mean(y)+mean(z))/(mean(z)-mean(t))), by=x]
#    x         ans
# 1: 1 -0.22222222
# 2: 2 -0.18750000
# 3: 3 -0.16949153
# 4: 4 -0.07142857
# 5: 5 -0.10309278

Let's break it down with the general syntax: dt[i, j, by]:

In i, we filter out for your conditions using a nice little hack TRUE | NA = TRUE and FALSE | NA = NA and NA | NA = NA (you can test these out in your R session).

Since you say you need only the non-zero non-NA values, it's just a matter of |ing each column with NA - which'll return TRUE only for your condition. That settles the subset by condition part.

Then for each group in by, we aggregate according to your function, in j, to get the result.

HTH

Hi Arun. How can I do the same calculations by groups x, y and z, but z split by subgroups: if 0 < z <= 2 group I, if 2 < z <= 3 group II and if 3 < z <= 4 group III. — user3742663, Jul 11 '14 at 18:48

Kara Woo · Answer 2 · 2014-06-25T21:04:09.703

1

Here's one solution:

# create your sample data frame
df <- read.table(text = " x y z  t
 1 1 1 10
 1 0 1 15
 2 NA 1 14
 2 3 0 15
 2 2 1 17
 2 1 NA 19
 3 4 2 18
 3 0 2 NA
 3 2 2 45
 4 3 2 NA
 4 1 3 59
 5 0 3 0
 5 4 3 45
 5 4 4 74
 5 1 4 86", header = TRUE)

library('dplyr')

dfmeans <- df %>%
  filter(!is.na(y) & !is.na(z) & !is.na(t)) %>% # remove rows with NAs
  filter(y != 0 & z != 0 & t != 0) %>% # remove rows with zeroes
  group_by(x) %>%
  summarize(xmeans = (mean(y) + mean(z)) / (mean(z) - mean(t)))

I'm sure there is a simpler way to remove the rows with NAs and zeroes, but it's not coming to me. Anyway, dfmeans looks like this:

#   x      xmeans
# 1 1 -0.22222222
# 2 2 -0.18750000
# 3 3 -0.16949153
# 4 4 -0.07142857
# 5 5 -0.10309278

And if you just want the values from xmeans use dfmeans$xmeans.

edited Jun 25 '14 at 21:04

answered Jun 19 '14 at 02:39

Kara Woo

3,595
19
31

Thank you. But what if we have one more column W, which contains numbers from 1 to 4 only, and we need to make the same calculations but for W=1 only. – user3742663 Jun 19 '14 at 03:31
You can add another `filter` line: `filter(W == 1)` – Kara Woo Jun 19 '14 at 03:50
Thank you. This is interesting solution, never worked with `filter` function – user3742663 Jun 19 '14 at 03:56
The whole `dplyr` package is incredibly useful, I highly recommend checking it out. If my solution worked for you you can accept the answer. :) – Kara Woo Jun 19 '14 at 04:07
I have the last question, please. For example, we have vector x(from 1 to 5), y and w (from 1 to 4). How can I calculate the number of values in y for each x, for w=1? In example above if x=2, the number of values in y equal to 3( don`t count NA) – user3742663 Jun 19 '14 at 04:40

Conditional calculation of means of different columns in data.table with R

2 Answers2

Update: