R average over ages when some ages missing

Question

I have a data.table with columns for Age, food category, and the kcal consumed. I'm trying to get the average kcal for each category, but for some of the categories there is no consumption in that category. So I can't take a simple average, because there are zeroes that aren't in the data.table.

So for the example data:

dtp2 <- data.table(age = c(4,4,4,5,6,18), category = c("chips","vegetables","pizza","chips","pizza","beer"), kcal = c(100,5,100,120,100,150))

just doing dtp2[,mean(kcal),by=category] gives the wrong answer because only the 18 year olds are consuming beer, and the 4-17 year olds aren't.

The actual data set is 4:18 year olds with many, many categories. I've tried populating the datatable with zeroes for omitted ages with a nested for loop, which is very slow, then taking the means as above.

Is there a sensible R way of taking the mean kcal where missing values are assumed to be zero, without nested for loops putting in the zeroes?

Apologies stack overflowers. I've put 4 spaces at the start of the code segments, but it doesn't seem to have formatted it as code. Help on that would be appreciated. — jimr1603, Jun 23 '17 at 16:26
Possible duplicate of [Fastest way to replace NAs in a large data.table](https://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table) — C8H10N4O2, Jun 23 '17 at 16:29
No, because the NAs are not in the table. There are only records where there is actual consumption of that category. — jimr1603, Jun 23 '17 at 16:34

Mr. Bugle · Accepted Answer · 2017-06-23T16:38:49.333

1

I take it you want to include missing or 0 kcal values in the calculation. Instead of taking the average, you could just sum by category and divide by the total n for each category.

edited Jun 23 '17 at 16:38

answered Jun 23 '17 at 16:33

Mr. Bugle

325
2
12

You're absolutely right and I should have thought of that! – jimr1603 Jun 23 '17 at 17:27

score 0 · Answer 2 · answered Jun 25 '17 at 10:32

The suggestion by Mr. Bugle is rather generic and doesn't show any code. Picking this up, the code of the OP needs to be modified as follows:

library(data.table)
dtp2[, sum(kcal) / uniqueN(dtp2$category), by = category]

which returns

     category    V1
1:      chips 55.00
2: vegetables  1.25
3:      pizza 50.00
4:       beer 37.50

Note that uniqueN(dtp2$category) is used not just uniqueN(category) as this is always 1 when grouped by category.

However, there are situations where missing values are assumed to be zero, without nested for loops putting in the zeroes as the OP has asked.

One situation could arise when data is reshaped from long to wide format for presentation of the data:

reshape2::dcast(dtp2, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)

    age beer chips pizza vegetables     (all)
1     4  NaN   100   100          5  68.33333
2     5  NaN   120   NaN        NaN 120.00000
3     6  NaN   NaN   100        NaN 100.00000
4    18  150   NaN   NaN        NaN 150.00000
5 (all)  150   110   100          5  95.83333

Here, the margin means are computed only from the available data which is not what the OP askd for. (Note that the parameter fill = 0 has no effect on the computation of the margins.)

So, the missing values need to be filled up before reshaping. In base R, expand.grid() can be used for this purpose, in data.table it's the cross join function CJ():

expanded <- dtp2[CJ(age, category, unique = TRUE), on = .(age = V1, category = V2)
                 ][is.na(kcal), kcal := 0][]
expanded

    age   category kcal
 1:   4       beer    0
 2:   4      chips  100
 3:   4      pizza  100
 4:   4 vegetables    5
 5:   5       beer    0
 6:   5      chips  120
 7:   5      pizza    0
 8:   5 vegetables    0
 9:   6       beer    0
10:   6      chips    0
11:   6      pizza  100
12:   6 vegetables    0
13:  18       beer  150
14:  18      chips    0
15:  18      pizza    0
16:  18 vegetables    0

Now, reshaping from long to wide returns the expected results:

reshape2::dcast(expanded, age ~ category, fun = mean, value.var = "kcal", margins = TRUE)

    age  beer chips pizza vegetables   (all)
1     4   0.0   100   100       5.00 51.2500
2     5   0.0   120     0       0.00 30.0000
3     6   0.0     0   100       0.00 25.0000
4    18 150.0     0     0       0.00 37.5000
5 (all)  37.5    55    50       1.25 35.9375

R average over ages when some ages missing

2 Answers2