My goal is quite simple - taking a data set from a survey and analysing the fraction how often each potential answer was given by each target group of interest. My code works, but its very chunky and therefore error-prone. I'd like to get rid of both, but despite thorough research seem to be incapable of doing so.
The data looks something like this (note that Var* columns contain zeros, which are not of interest, and can either be binary answers (0 and 1 only) or have multiple answers (e.g. 0 to 4), which I need to take care of later):
head(my_data)
ID Gender AgeGroup Var1 Var2 Var3 Var4
1 1 1 1 1 2 3
2 1 2 0 0 1 2
3 2 1 1 1 2 1
4 1 2 1 1 1 2
5 2 1 0 1 3 1
6 1 2 0 1 2 1
My final output should ideally look something like this:
TG1 TG2 TG3
Var11 60.49% 56.67% 64.17%
Var21 67.3% 56.67% 77.54%
Var31 40.87% 39.44% 42.25%
Var32 27.27% 55.56% 21.23%
Var33 31.86% 5.0% 36.52%
My current script:
I first create subsets of the data containing the target groups of interest and an empty data frame to hold the results later on:
TG1 <- subset (my_data, my_data$Gender == 1)
TG2 <- subset (my_data, my_data$Gender == 2)
TG3 <- subset (my_data, my_data$Var3 == 1 | my_data$Var3 == 2)
Results <- data.frame (TG1=numeric(0), TG2=numeric(0), TG3=numeric(0))
Now comes a massive loop:
rownames <- c() #Vector to hold the results temporarily
ColCounter <- 4 #Variable containing the column of the variable currently being calculated
while (ColCounter <= ncol(my_data)) {
ColCat <- max(my_data[,ColCounter]) #what is the maximum value in the current column?
Cat <- 1
while (Cat <= ColCat) {
t1 <- paste(round(sum(TG1[,ColCounter] == Cat)/nrow(TG1)*100, digits=2), "%", sep="")
t2 <- paste(round(sum(TG2[,ColCounter] == Cat)/nrow(TG2)*100, digits=2), "%", sep="")
t3 <- paste(round(sum(TG3[,ColCounter] == Cat)/nrow(TG3)*100, digits=2), "%", sep="")
Results[nrow(Results)+1,] <- c(t1,t2,t3)
rownames <- c(rownames, paste (strtrim(names(my_data[ColCounter]), 30), Cat, sep=""))
Cat <- Cat + 1
}
ColCounter <- ColCounter + 1
}
row.names(Results) <- make.names (rownames, unique=TRUE)
I feel that this should be much easier achieved by writing a function to do the calculation (and potentially another to get the maximum number of categories for each column) and using one of the apply
functions to cycle through the various data frames containing the target groups (which are held in a list). Written in a very raw way:
TargetGroups <- lapply(ls(pattern = "TG[1-9]"), get)
names(TargetGroups) <- c("TG1", "TG2", "TG3")
Calc_Perc <- function (...) {
...
}
Results <- lapply(TargetGroups, Calc_Perc)
However, so far all of my approaches have failed, despite reading up on masses of entries here and elsewhere on using apply
on lists and dataframes. Is there a good way to achieve this?