0

My goal is quite simple - taking a data set from a survey and analysing the fraction how often each potential answer was given by each target group of interest. My code works, but its very chunky and therefore error-prone. I'd like to get rid of both, but despite thorough research seem to be incapable of doing so.

The data looks something like this (note that Var* columns contain zeros, which are not of interest, and can either be binary answers (0 and 1 only) or have multiple answers (e.g. 0 to 4), which I need to take care of later):

head(my_data)
ID Gender AgeGroup Var1 Var2 Var3 Var4
 1      1        1    1    1    2    3
 2      1        2    0    0    1    2
 3      2        1    1    1    2    1
 4      1        2    1    1    1    2
 5      2        1    0    1    3    1
 6      1        2    0    1    2    1

My final output should ideally look something like this:

          TG1     TG2     TG3
Var11  60.49%  56.67%  64.17%
Var21  67.3%   56.67%  77.54%
Var31  40.87%  39.44%  42.25%
Var32  27.27%  55.56%  21.23%
Var33  31.86%  5.0%    36.52%

My current script:
I first create subsets of the data containing the target groups of interest and an empty data frame to hold the results later on:

TG1 <- subset (my_data, my_data$Gender == 1)
TG2 <- subset (my_data, my_data$Gender == 2)
TG3 <- subset (my_data, my_data$Var3 == 1 | my_data$Var3 == 2)

Results <- data.frame (TG1=numeric(0), TG2=numeric(0), TG3=numeric(0))

Now comes a massive loop:

rownames <- c() #Vector to hold the results temporarily
ColCounter <- 4 #Variable containing the column of the variable currently being calculated

while (ColCounter <= ncol(my_data)) {
  ColCat <- max(my_data[,ColCounter]) #what is the maximum value in the current column?
  Cat <- 1
  while (Cat <= ColCat) {
    t1 <- paste(round(sum(TG1[,ColCounter] == Cat)/nrow(TG1)*100, digits=2), "%", sep="")
    t2 <- paste(round(sum(TG2[,ColCounter] == Cat)/nrow(TG2)*100, digits=2), "%", sep="")
    t3 <- paste(round(sum(TG3[,ColCounter] == Cat)/nrow(TG3)*100, digits=2), "%", sep="")
    Results[nrow(Results)+1,] <- c(t1,t2,t3)
    rownames <- c(rownames, paste (strtrim(names(my_data[ColCounter]), 30), Cat, sep=""))
    Cat <- Cat + 1
    }
  ColCounter <- ColCounter + 1
  }
row.names(Results) <- make.names (rownames, unique=TRUE)

I feel that this should be much easier achieved by writing a function to do the calculation (and potentially another to get the maximum number of categories for each column) and using one of the apply functions to cycle through the various data frames containing the target groups (which are held in a list). Written in a very raw way:

TargetGroups <- lapply(ls(pattern = "TG[1-9]"), get)
names(TargetGroups) <- c("TG1", "TG2", "TG3")

Calc_Perc <- function (...) {
  ...
  }

Results <- lapply(TargetGroups, Calc_Perc)

However, so far all of my approaches have failed, despite reading up on masses of entries here and elsewhere on using apply on lists and dataframes. Is there a good way to achieve this?

  • What are `Var 11` through `Var 33`? – r2evans Jul 18 '18 at 16:46
  • Each Var# represents a question asked in the survey. The second number represents the 'answer option', so Var31 may correspond to "very likely", Var32 to "likely", Var33 to "not likely" and so forth. – Richard Wallace Jul 19 '18 at 14:13

1 Answers1

0

Here is a tidyverse approach. Your data is as above:

my_data <- read.table(text = "ID Gender AgeGroup Var1 Var2 Var3 Var4
 1      1        1    1    1    2    3
 2      1        2    0    0    1    2
 3      2        1    1    1    2    1
 4      1        2    1    1    1    2
 5      2        1    0    1    3    1
 6      1        2    0    1    2    1", header = TRUE)

Start out by putting your groups above into a list for ease of use:

groups_list <- list(
  TG1 = subset(my_data, my_data$Gender == 1),
  TG2 = subset(my_data, my_data$Gender == 2),
  TG3 = subset(my_data, my_data$Var3 == 1 | my_data$Var3 == 2)
)

Now use lapply to apply a function that first converts each subset data frame into long format, gets proportions of each answer per group, and outputs those proportions for each data frame in groups_list (more details in comments):

list_proportion_dfs <- lapply(names(groups_list), function(x) {
  ## Convert to long format
  long = groups_list[[x]] %>%
    gather(key = var, value = val, -c(ID, Gender, AgeGroup))

  proportions = long %>%
    ### Group by variable and value
    group_by(var, val) %>%
    ### Assign the length of each grouping
    ### to the new variable 'n'
    summarize(n = n()) %>%
    ### Convert to a proportion by dividing
    ### n by the sum of n for the current 
    ### 'var' grouping
    mutate(
      var.combo = paste(var, val, sep = ""),
      x = n / sum(n) * 100
    ) %>% 
    ungroup() %>%
    ### Keep only the important rows
    select(var.combo, x)
  names(proportions) <- c("var.combo", x)
  return(proportions)
})

Output looks like:

> list_proportion_dfs
[[1]]
# A tibble: 9 x 2
  var.combo   TG1
  <chr>     <dbl>
1 Var10        50
2 Var11        50
3 Var20        25
4 Var21        75
5 Var31        50
6 Var32        50
7 Var41        25
8 Var42        50
9 Var43        25

[[2]]
# A tibble: 6 x 2
  var.combo   TG2
  <chr>     <dbl>
1 Var10        50
2 Var11        50
...
...

Now you can use Reduce and merge (as per this answer) to get close to your desired outcome:

output <- Reduce(function(x, y) merge(x, y, all = TRUE), list_proportion_dfs)

Convert the NA values to zeroes:

output[is.na(output)] <- 0

And your (unformatted) outcome looks like:

> output
   var.combo TG1 TG2 TG3
1      Var10  50  50  40
2      Var11  50  50  60
3      Var20  25   0  20
4      Var21  75 100  80
5      Var31  50   0  40
6      Var32  50  50  60
7      Var33   0  50   0
8      Var41  25 100  40
9      Var42  50   0  40
10     Var43  25   0  20

I'm pretty sure the math is right, as all proportions for a given group/var combo add to 100 % (which is not the case for your example output). You may have to fiddle with the grouping order/levels if I've misunderstood what percentage you're actually looking for.

For output a little closer to what you showed (while avoiding multple "%" symbols, you could do:

rownames(output) <- output$var.combo

output <- select(output, -var.combo)

names(output) <- c(paste(names(output), "(%)"))

> output
      TG1 (%) TG2 (%) TG3 (%)
Var10      50      50      40
Var11      50      50      60
Var20      25       0      20
Var21      75     100      80
Var31      50       0      40
Var32      50      50      60
Var33       0      50       0
Var41      25     100      40
Var42      50       0      40
Var43      25       0      20

But I'd guess you're probably formatting in RMarkdown or Excel anyway.

Luke C
  • 10,081
  • 1
  • 14
  • 21