0

I have data coming from a questionnaire-based survey, for which I would like to plot the cumulative number of answers collected over time, so I could visualize how it evolved and the dates when most people answered.

An example dataset would look like this. Each respondent has an ID, an answer date, and their replies to questions.

d <- data.frame(
    id = c(1, 2, 3, 4),
    date = c("2023-01-01", "2023-01-01", "2023-01-02", "2023-01-03"),
    age = c(22, 26, 19, 31),
    another_question = c("answer 1", "answer 2", "answer 1", "answer 3")
)

This plot (coming from here) would be a good example of what I'd like to do.

Example of cumulative answers plot.

Michael Marx
  • 104
  • 8

1 Answers1

0

So I did a little tinkering and came up with a pretty satisfying result.

As there may be no answer at some dates, and I wanted a continuous line plot, I helped myself with this post to add the missing dates, and with this one to do the cumulative plot.

So here is the code I ended up with. I used min(d$date) and max(d$date) to automate the plot limits, but this may be unsatisfaying. It can be set manually easily with some ymd("your-date"). I also added some annotations for the key moments in the distribution of the questionnaire.

library(padr)
library(tidyverse)
library(lubridate)

d <- data.frame(
    id = c(1, 2, 3, 4),
    date = c("2023-01-01", "2023-01-01", "2023-01-02", "2023-01-03"),
    age = c(22, 26, 19, 31),
    another_question = c("answer 1", "answer 2", "answer 1", "answer 3")
)
d$date <- ymd(d$date)

answers_by_date <- d %>%
    select(date, id) %>% mutate(id = 1) %>%
    pad(start_val = min(d$date), end_val = max(d$date), interval = "day") %>%
    fill_by_value(id, value = 0) %>%
    group_by(date) %>%
    summarize(Freq=sum(id)) %>%
    mutate(cumulative = cumsum(Freq))

answers_by_date %>%
    ggplot(aes(date, cumulative)) +
    geom_line() +
    scale_x_date(
        date_labels = "%d %b",
        date_breaks = "1 week",
        date_minor_breaks = "1 day",
        limit = c(ymd("2022-12-01"), ymd("2023-01-23"))
    ) +
    geom_vline(xintercept = ymd("2023-01-01"), linetype = 4) +
    annotate("label", x = ymd("2023-01-01"), y = 90, size = 2, label = "Questionnaire sending") +
    geom_vline(xintercept = ymd("2023-01-03"), linetype = 4) +
    annotate("label", x = ymd("2023-01-03"), y = 70, size = 2, label = "Questionnaire closure") +
    labs(title = "Cumulative answers by date", x = "Date", y = "Total answers")

The trick with mutate(id = 1) and summarize(Freq=sum(id)) is a bit ugly, but it works. (I would be really interested if you have a cleaner solution for this.)

Here is the rendering with some real data (with french labels).

Cumulative answers by date in a real survey with this code.

This kind of plot is common in social sciences survey reports, and I had a bit of a hard time to figure out how to do it, so I hope it will save some time to someone else. ^^

Michael Marx
  • 104
  • 8