How to calculate population prevalence in R using data.table

Question

I hoping someone can point me in the right direction of how I can calculate a population prevalence proportion without having to do it in excel.

I´m currently working on a project that requires me to find the annual prevalence for eating fruit stratified by sex, agegr. To complicate matters unique id's can eat fruit from different kinds of fruitgroups but two of the four fruits belong to the same fruitgroup and therefor can only be counted once per id.

To further complicate the issue I need to report the prevalence by

total fruits per year per 1000 in the population,
per individual fruits per year per 1000,
per the fruitgroup that has the two fruit in,
by age
by sex

Here is my dataset - you might have figured out that due to restrictions on my database I had to change the actual values - special thanks to this reproduce code here and to this answer on how to make your data anonymous

    #### Data ####
require(data.table) 
anonDT <- data.table(structure(list(id = c("E1998", "E2308", "E1421", "E676", "E5061","E4225", "E2600", "E658", "E2331", "E982", "E4790", "E408", "E1048","E3937", "E4554", "E3357", "E2637", "E178", "E3734", "E1217","E3771", "E1954", "E4928", "E3566", "E1106", "E3835", "E1505","E668", "E4083", "E3066", "E3356", "E4910", "E2801", "E1074","E5097", "E610", "E995", "E1001", "E3824", "E3427", "E3885","E648", "E1986", "E4777", "E2546", "E909", "E1954", "E634", "E2602","E531", "E67", "E2418", "E3863", "E4266", "E196", "E657", "E1516","E4722", "E3077", "E3732", "E1556", "E112", "E924", "E2801","E2742", "E3362", "E1880", "E3645", "E3357", "E2519", "E2450","E5162", "E1483", "E3846", "E4539", "E2452", "E282", "E4604","E226", "E5043", "E3909", "E88", "E51", "E1925", "E2776", "E3835","E4746", "E1631", "E4052", "E1128", "E220", "E1390", "E4908","E1385", "E1003", "E5181", "E3835", "E4910", "E3240", "E4380","E3357", "E963", "E706", "E5142", "E2869", "E3839", "E5271","E2584", "E194", "E4366", "E2621", "E932", "E1104", "E1964","E928", "E4377", "E1418", "E2940", "E3420", "E3958", "E4130","E790", "E3667", "E934", "E3356", "E5203", "E3835", "E3356","E3297", "E5203", "E4380", "E668", "E2856", "E4502", "E1054","E3644", "E4641", "E5204", "E2597", "E4432", "E2716", "E2422","E1964", "E1327", "E2028", "E2727", "E1868", "E638", "E88", "E4892","E706", "E5147", "E3130", "E4099", "E4239", "E341", "E593", "E4746","E2291", "E2240", "E2481", "E3846", "E2602", "E1673", "E4772","E2140", "E5024", "E1137", "E2182", "E4366", "E2386", "E648","E3118", "E8", "E2813", "E3422", "E3982", "E2", "E2940", "E2035","E4746", "E5134", "E4380", "E4615", "E1372", "E2249", "E1954","E2418", "E1122", "E3485", "E934", "E3611", "E2665", "E2961","E2108", "E4432", "E2447", "E3413", "E531", "E1751"),
                                sex = structure(c(1L,2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 1L, 2L, 1L, 1L, 1L), .Label = c("male", "female"), class = "factor"),group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 3L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,2L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 3L, 2L, 1L, 1L, 2L, 2L, 2L,2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("fruitgr1","fruitgr2", "fruitgr3"), class = "factor"),
                                subgroup = structure(c(4L,3L, 3L, 3L, 3L, 4L, 4L, 4L, 2L, 4L, 3L, 3L, 4L, 3L, 3L, 3L,4L, 1L, 3L, 4L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 4L, 3L, 4L, 4L,3L, 4L, 1L, 4L, 3L, 3L, 4L, 2L, 1L, 4L, 3L, 3L, 4L, 3L, 3L,1L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 3L, 3L, 3L,4L, 3L, 4L, 4L, 3L, 4L, 3L, 3L, 4L, 3L, 1L, 3L, 1L, 4L, 2L,1L, 3L, 1L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 3L, 3L, 4L, 3L,4L, 3L, 4L, 3L, 2L, 1L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 3L, 4L,1L, 3L, 4L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 3L, 3L, 2L, 2L, 1L,4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 3L, 2L, 3L, 2L, 4L,2L, 3L, 4L, 1L, 2L, 4L, 1L, 4L, 3L, 4L, 3L, 4L, 4L, 3L, 4L,2L, 1L, 2L, 3L, 1L, 1L, 4L, 3L, 3L, 4L, 1L, 3L, 4L, 4L, 4L,1L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 4L, 4L, 3L, 1L, 3L, 4L, 4L,4L, 3L, 4L, 4L, 1L, 1L, 4L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L,4L, 1L, 4L, 4L), .Label = c("apple", "orange", "banana","kiwi"), class = "factor"),
                                agegr = structure(c(2L, 3L, 2L,2L, 3L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 3L,1L, 2L, 2L, 3L, 2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 2L,2L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 2L,3L, 3L, 2L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 1L, 2L, 3L, 2L, 1L, 3L, 1L,1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 2L,1L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 2L,2L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L,3L, 2L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,1L, 2L), .Label = c("19-24", "25-49", "50+"), class = "factor"),year = c(2004, 2007, 2008, 2006, 2008, 2007, 2008, 2007,2007, 2007, 2005, 2005, 2006, 2006, 2006, 2006, 2006, 2004,2008, 2006, 2006, 2007, 2006, 2006, 2005, 2006, 2005, 2005,2006, 2007, 2006, 2008, 2008, 2006, 2004, 2004, 2007, 2005,2008, 2008, 2005, 2007, 2008, 2008, 2008, 2008, 2005, 2008,2008, 2005, 2005, 2006, 2007, 2007, 2006, 2006, 2007, 2007,2008, 2008, 2005, 2007, 2007, 2005, 2008, 2007, 2004, 2008,2007, 2008, 2005, 2005, 2008, 2005, 2007, 2008, 2008, 2005,2004, 2008, 2004, 2007, 2005, 2008, 2004, 2008, 2008, 2006,2008, 2006, 2007, 2008, 2005, 2007, 2007, 2007, 2006, 2007,2007, 2008, 2006, 2005, 2008, 2004, 2008, 2008, 2006, 2005,2004, 2007, 2006, 2004, 2005, 2004, 2006, 2005, 2008, 2004,2007, 2004, 2006, 2008, 2006, 2007, 2008, 2005, 2007, 2007,2006, 2005, 2008, 2004, 2008, 2007, 2008, 2004, 2008, 2007,2007, 2004, 2004, 2008, 2004, 2007, 2008, 2005, 2007, 2005,2008, 2006, 2006, 2007, 2005, 2005, 2006, 2006, 2005, 2007,2006, 2008, 2005, 2006, 2008, 2007, 2005, 2006, 2006, 2007,2007, 2006, 2008, 2007, 2007, 2008, 2005, 2008, 2007, 2004,2005, 2006, 2007, 2006, 2008, 2006, 2008, 2004, 2006, 2008,2007, 2004, 2008, 2008, 2004, 2008, 2007, 2006, 2005, 2006,2004, 2004)), .Names = c("id", "sex", "group", "subgroup","agegr", "year"), class = c("data.table", "data.frame"), row.names = c(NA,-200L)))

And the population data is here:

#### Population data ####
populDT <- data.table(structure(list(year = 2004:2008, total = c(210696L, 216192L,223472L, 230629L, 233625L), men = c(104770L, 108390L, 113597L,117629L, 118275L), women = c(105926L, 107802L, 109875L, 113000L,115350L), agegrp1 = c(25721L, 25558L, 25933L, 27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L, 115102L),agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year","total", "men", "women", "agegrp1", "agegrp2", "agegrp3"), sorted = "year", class = c("data.table","data.frame"), row.names = c(NA, -5L), key='year'))

I have managed to get all the relevant prevalence counts into a single data.table

allcount <- anonDT[,.N, keyby=list(year,agegr,sex,group,subgroup,id)][,.N,by=list(year,agegr,sex,group,subgroup)]
allcount

as well as making several subsets of data.tables including the counts I need.

As for my questions,

is there a simple way of calculating the prevalence proportion (formula is 1000*those who eat fruit/total population) inside the data.table for all the relevant stratifications and groups.
Would I need to merge the populDT with the allcount?
If so what would be the best way forward
And as a bonus conundrum what would be the best way to create publication table/graph from the data.table?

Thank you all in advance... as this is my first time posting I hope I've done everything correctly :)

Here is a data.table of the almost desired outcome. I only managed to get it in this state by using data.frame not table. To be more precise I created count variables for all and then used aggregate to get it into shape

### first using max ### 
prevAN <- aggregate(anon[,7:17], 
              by = list(anon$year,anon$id,anon$group, anon$subgroup), 
              FUN = max) 
### and then by sum ###
prevAN1 <- aggregate(prevAN[,5:15], 
               by = list(prevAN$year), 
               FUN = sum)


### the count dataset ###
DT2 <- data.table(structure(list(year = c(2004, 2005, 2006, 2007, 2008), count = c(865,1095, 1355, 1602, 1749), men = c(470, 616, 748, 863, 946), women = c(395,479, 607, 739, 803), agegr1 = c(141, 220, 272, 316, 385), agegr2 = c(552,657, 826, 1001, 1040), agegr3 = c(172, 218, 257, 285, 324), c_fruitgr2 = c(703,910, 1130, 1304, 1451), c_banana = c(153, 397, 618, 798, 950),c_kiwi = c(550, 513, 512, 506, 501), c_apple = c(121, 114,97, 110, 112), c_orange = c(41, 71, 128, 188, 186), total = c(210696L,216192L, 223472L, 230629L, 233625L), men.1 = c(104770L, 108390L,113597L, 117629L, 118275L),
                             women.1 = c(105926L, 107802L,109875L, 113000L, 115350L), agegrp1 = c(25721L, 25558L, 25933L,27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L,115102L), agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year", "count", "men", "women", "agegr1","agegr2", "agegr3", "c_fruitgr2", "c_banana", "c_kiwi", "c_apple","c_orange", "total", "men.1", "women.1", "agegrp1", "agegrp2","agegrp3"), sorted = "year", class = c("data.table", "data.frame"), row.names = c(NA, -5L), key='year'))

When I get to this stage I need to calculate the prevalence from the figures in each year. I would do this using a data.frame by simply doing the following:

total.prev <- DF2$count*1000/DF2$total

Back to the questions at hand, firstly I love how fast data.table works and even though I am able to roughly get where I need to go with my data using aggregate I would like to learn how I can do this in data.table as I have a feeling that it is quicker and more productive.

Secondly there is the issue of the calculation, calculating within a data.table or between data.tables.

And thirdly there is the issue of graphs and tables for publication but that might be a second post.

UPDATE I have figured out a way to calculate the prevelance within the data.table by doing the following

DT2[,':='(p1= count*1000/total,
        pM= men*1000/men.1,
        pW= women*1000/women.1,
        pA1= agegr1*1000/agegrp1,
        pA2= agegr2*1000/agegrp2,
        pA3= agegr3*1000/agegrp3,
        pFgr2= c_fruitgr2*1000/total,
        pB= c_banana*1000/total,
        pK= c_kiwi*1000/total,
        pA= c_apple*1000/total,
        pO= c_orange*1000/total),]

DT2<-round(DT2, 2)  # to round the prevalence numbers

Again if someone has anything simpler to offer :)

separately, I recommend creating example data that's simple enough that you can show the desired outcome - personally, I don't understand what exactly you want — eddi, Mar 07 '14 at 23:36
Dataset should be working now. As for creating a simpler example data, I'll get on that now. — DrifaP, Mar 07 '14 at 23:46
@eddi sorry it took so long but I´ve managed to create a clearer example - I hope. — DrifaP, Mar 08 '14 at 13:30
The data in the question is still very complex, and you're asking four questions. Please simplify it (you can generally communicate what you're trying to do with built-in datasets) and split this out into separate SO questions. — Max Ghenis, Jun 23 '14 at 02:48

How to calculate population prevalence in R using data.table

0 Answers0