I'm new to R and would very much appreciate if you could share your thoughts. Having found Applying calculation per groups within R dataframe, I decided to use data.table to calculate % of respondents who returned 'b' out of all respondents who returned either 'a' or 'b'.
My example data can be obtained from below (with some warning messages):
library(data.table)
df = data.table(rep(c("I", "II", "III", "IV")), rep(c("A", "B", "C")),
rep(c("a", "a", "b", "b", "b"), 10))
colnames(df) = c("grp1", "grp2", "Q1")
Below are my current codes to obtain a desired outcome;
#1. Obtain the number of rows on Q1 with 'a' and 'b' broken down by grp1 and grp2
ab = df[Q1 %in% c("a", "b"), .N, by = .(grp1, grp2)][order(grp1, grp2)]
#2. Obtain the equivalent number of rows on Q1 with 'b' only
b = df[Q1 %in% "b", .N, by = .(grp1, grp2)][order(grp1, grp2)]
# 3. Merge tables 'ab' and 'b' into 'a'
a = merge(b, ab, by = c("grp1", "grp2"))
# 4. Calculate %b in table a
a$agree = a[, 3]/a[, 4]*100
grp1 grp2 N.x N.y agree
1: I A 3 5 60
2: I B 2 4 50
3: I C 3 4 75
4: II A 2 4 50
5: II B 3 5 60
6: II C 2 4 50
7: III A 2 4 50
8: III B 2 4 50
9: III C 3 4 75
10: IV A 3 4 75
11: IV B 3 4 75
12: IV C 2 4 50
# 5. Remove unnecessary columns
a[, 3:4] = NULL
grp1 grp2 agree
1: I A 60
2: I B 50
3: I C 75
4: II A 50
5: II B 60
6: II C 50
7: III A 50
8: III B 50
9: III C 75
10: IV A 75
11: IV B 75
12: IV C 50
Above is the output I need. Now I have 50 questionnaire items (Q1 to Q50), and if I want to obtain the same outputs for those 50 questions, I would have to repeat the same process (1 to 5) 50 times, and this is not ideal.
I would like to find a way to avoid this repetitive process.
I'm sure there must be an elegant way to automate this process, but as a r-newbie, I'm yet to find out how. Rather than driven by my limited knowledge, I thought I should be guided by good practice you may have.