Grouping, sorting, and counting large amounts of data efficiently in R?

Question

I'm an R novice currently working with a pretty large national health survey database of 17925 observations in R with around 350 columns. This database contains information of 32 states, and contains information on the respondent's gender.

As usual, both gender and state are coded numerically, the former being typed down as 0s and 1s, and the latter ranging from 1 through 32. Now, I'm used to working with large numbers, however I'm having difficulty with sorting through respondent's answers while taking into account their gender and state in a tidy manner.

For example, let's say I'm currently working with the the next 3 columns in my database

head(data)
      ENT SEX var
  1  15    1    1
  2   8    1    4
  3   2    1    6
  4   3    2    9
  5   6    1    3
  6   15   2    12

And I need to know: a) The total amount of people who answered "x" in ENT b) The total amount of people who both answered "x" in ENT, "y" in SEX, and 1 through 12 in var

I need to replicate this for every state and every sex.

I've done this for only one state (ENT = "1") but in a very roundabout and untidy way.For counting the amount of people in State 1 who answered 1 through 12 I did the following.

data_1 <- data %>% filter(data$var > 0 & data$var < 13 & ENT == 1) %>% 
summarise(counted = n())

Afterwards, I count the number of people who answered from 1 through 12 using the same lines of code, except I simply add the condition of SEX == 1 or 2 and then add it to a data frame.

As readers might glean from this, it adds up to be a quite unclean and messy code, but I'm struggling to come up with a shorter and more efficient syntax. As of now, just calculating two states takes me around 120 lines of code. I'm certain there must be a way of doing this much more efficiently, but I can't see it.

Ideally, I would group all observations that range from var = 1-12 for the different instances of ENT into something that could look like this (this is just an example):

ENT  Observations1-12  TotalObservationsofvar  Womenwansw1-12  Menansw1-12 TotalMen TotalWom    
 1            4555            9000                    533           133      1000      3333
 2            3222            9000                    233           455      9888      1111
 3            5678            9000                    544           544      1444      7794
 4            2367            9000                    322           222      5555      5555   
 5            7864            9000                    122           677      6666      1111

I'm stumped and I would greatly appreciate some pointers, as it would help me infinitely in working this problem and many more future ones out.

(1) (Almost) "Never" use `data$` in a dplyr pipe based off of `data`. There are few exceptions, this is not one of them. (It's not hurting you _here_, _yet_, but I urge you to drop that habit.) (2) This sounds like tabulation, which in dplyr is easily handled with calls to `count` with the columns of interest. See https://stackoverflow.com/q/11562656/3358272. — r2evans, Mar 16 '23 at 19:26
(a) `count(data, ENT)` will return counts for all possible values of `ENT`, you can follow-up with `%>% filter(ENT == "x")`; in base R, you can do `table(data$ENT)["x"]` for a simple number reporting; (b) `count(data, ENT, SEX, var) %>% filter(ENT == "x", SEX %in% 1:2, between(var, 1, 12))` should produce a good start. It's going to be difficult to combine those two types of comparison into one table, since they have different numbers of columns. There are some ways to work around it, but none of them clearly "awesome" in my book. — r2evans, Mar 16 '23 at 20:49

Grouping, sorting, and counting large amounts of data efficiently in R?

0 Answers0