0

I have created a survey form and I collect the results in R. There are many questions to the survey and they are grouped into sections. For instance the first three questions are under the category: Behaviors. The user is able to go back into the survey after their initial submission to update their answers (all questions are True/False). I am struggling to develop a cumulative score over time given that users can go back in and update their answers. Below is an example of the survey data results for the Behaviors category:

SurveyData:

SurveyData

I would like to develop a score that will take the most recent survey result for a user, but I need to keep the previous submission data to show progress. For example, this is the result I would like:

Result:

Result

The formula for the Score column would be (#Trues for that day/(number_of_submissions for that day/number of questions for that category). The number of questions for the Behaviors category is 3. However, I am struggling how to calculate the score when users re-submit the survey with new answers. Any idea how I can do this in R?

This is my current approach, which calculates the cumulative score, but does not take into account if a user submitted a survey more than once, as it does not override and take the most recent survey result.

Reproducible Example:

user <- c('User1', 'User1', 'User1', 'User2', 'User3', 'User3')
submission_date <- as.Date(c('2016-09-04', '2016-09-15', '2016-09-29', '2016-09-04', '2016-09-05', '2016-09-20'))
Trues_Behavior_Category <- c(1,2,3,3,2,3)
survey_data <- data.frame(user, submission_date, Trues_Behavior_Category)
checklist_data_weekly <- aggregate(survey_data[c(3)], list(Sub_Date = as.Date(survey_data$submission_date)), sum)
checklist_data_cum <- cbind(checklist_data_weekly[c(1)], cumsum(checklist_data_weekly[c(2)]))
sub_count <- aggregate(survey_data[c(1)], list(Sub_Date = as.Date(survey_data$submission_date)), length)
sub_count <- cbind(sub_count[c(1)], Cum_Dates=cumsum(sub_count$user))
checklist_data_cum$Behaviors_Score <- (checklist_data_cum$Trues_Behavior_Category/(sub_count$Cum_Dates*3)*100)
CPak
  • 13,260
  • 3
  • 30
  • 48
John
  • 315
  • 1
  • 2
  • 12
  • 2
    Please provide a reproducible example. See this post to learn how to do it: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – www Sep 04 '17 at 17:44
  • I just added in an example. Apologies, I am relatively new to SO submission guidelines. – John Sep 04 '17 at 18:26
  • At least for me, a little more clarity is needed on what you are trying to accomplish. First, are you wanting to keep track of scores by user? Because your example result does not reflect that. Also, the example result does not match the formula you gave for the score column, at least in part because I think the operation within the inner parentheses was meant to be multiplication rather than division, but even that still wouldn't match for all values you say you want the result to be. – duckmayr Sep 04 '17 at 19:59
  • I just want to calculate the total score at a weekly interval, not per user. However, in order to do that I need to grab the most recent survey data for a particular user in the event they submitted the survey more than once. The following shows how I'd like the calculation logic to work: 09-04: (1+3)/(2submissions*3questions) | 09-05: (1+3+2)/(3*3) | 09-15: first instance in which a date occurs where someone submitted a survey more than once (take survey result from 09-15) (2+3+2)/(3*3) | 09-20: Again, someone revised an existing survey (2+3+3)/(3*3) | Again, 09-29: (3+3+3)/(3*3). Thank you! – John Sep 04 '17 at 20:57

1 Answers1

0

One way to do it is to create a function that gets the number of users at date x, a function that gets each user's latest response at date x, and then a function that applies those functions across the users and dates in your data:

user <- c('User1', 'User1', 'User1', 'User2', 'User3', 'User3')
submission_date <- as.Date(c('2016-09-04', '2016-09-15', '2016-09-29',
                             '2016-09-04', '2016-09-05', '2016-09-20'))
Trues_Behavior_Category <- c(1, 2, 3, 3, 2, 3)
survey_data <- data.frame(user, submission_date, Trues_Behavior_Category,
                          stringsAsFactors=FALSE)

get_last_submission <- function(user, df, date, col_name){
    df <- df[df$user == user & df$submission_date <= date, ]
    return(df[order(df$submission_date, decreasing=TRUE)[1], col_name])
}

get_current_user_n <- function(df, date){
    return(length(unique(df$user[df$submission_date <= date])))
}

get_score <- function(df, col_name, num_questions){
    users <- unique(df$user)
    return(sapply(sort(unique(df$submission_date)), function(x){
        (sum(as.numeric(sapply(users, get_last_submission, df, x, col_name)),
             na.rm=TRUE) / (get_current_user_n(df, x) * num_questions))
    }))
}

behavior_scores <- get_score(survey_data, 'Trues_Behavior_Category', 3)
behavior_scores
[1] 0.6666667 0.6666667 0.7777778 0.8888889 1.0000000

final_result <- data.frame(Sub_Date=sort(unique(survey_data$submission_date)),
                           Behaviors_Score=behavior_scores)
final_result
    Sub_Date Behaviors_Score
1 2016-09-04       0.6666667
2 2016-09-05       0.6666667
3 2016-09-15       0.7777778
4 2016-09-20       0.8888889
5 2016-09-29       1.0000000

The result is, as you can see, scores of 2/3, 2/3, 7/9, 8/9, and 1 as requested.

EDIT:

I realize I didn't explain the function's arguments; df is the data frame with your survey data, col_name is the name of the column you want a score for, and num_questions is the number of questions for that category.

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • At first glance this looks absolutely perfect! Thank you!! Exactly what I needed and very elegant approach! I will let you know if I run into any issues. Thanks again!!! – John Sep 06 '17 at 00:46