How to visualize the data if one participant has multiple entries in different rows?

Question

I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.

For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.

The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this

I have tried to create extra columns for each visit by using the code below:

participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)

df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")

colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
  "Visit1","Visit2","Visit3")

However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset

df1 <- df[,-2]

df1 %>%
  group_by(participant)

What function should I use this case for:

combining rows of the same participant?
how to produce the progress chart?

Thank you in advance for your help!

Please don't post data as images. Take a look at how to make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for ways of showing data. The gold standard for providing data is using `dput(head(NameOfYourData))`, *editing* your question and putting the `structure()` output into the question. — Martin Gal, Sep 02 '21 at 22:40
`group_by` just groupes the data in an abstract way. You have to apply a function (usally used in `mutate()` or `summarise()`) to change the data.frame. — Martin Gal, Sep 02 '21 at 23:15
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Sep 02 '21 at 23:30

Martin Gal · Accepted Answer · 2021-09-02T23:14:18.577

Based on your df you could produce the chart with

library(ggplot2)
library(dplyr)

df %>% 
  ggplot(aes(x = as.factor(visit), 
             y = as.factor(participant), 
             fill = as.factor(visit))) +
  geom_tile(aes(width = 0.7, height = 0.7), color = "black") + 
  scale_fill_grey() +
  xlab("Visit") + 
  ylab("Participants") +
  guides(fill = "none")

If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use

library(tidyr)
library(dplyr)

df %>% 
  mutate(value = 1) %>% 
  pivot_wider(
    names_from = visit,
    values_from = value,
    names_glue = "Visit{visit}",
    values_fill = 0)

to get

# A tibble: 5 x 4
  participant Visit1 Visit2 Visit3
        <dbl>  <dbl>  <dbl>  <dbl>
1           1      1      1      1
2           2      0      0      1
3           3      0      1      1
4           4      1      0      0
5           5      1      1      1

Oh my god, thank you so much for this! I think I have overcomplicated the problem by creating extra columns indicating separate visit. Thank you so much! I will go ahead and read more about the ggplot2 package. — Helen Andrews, Sep 03 '21 at 10:18
@HelenAndrews Take a look at http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html for nice examples of `ggplot2`-plots. :-) — Martin Gal, Sep 03 '21 at 11:12

GuedesBF · Answer 2 · 2021-09-02T23:34:26.117

I think you are looking for a way to dummify a variable. There are several ways to do that.

I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.

df %>% fastDummies::dummy_cols(select_columns = 'visit',
                                remove_selected_columns = TRUE)

   participant visit_1 visit_2 visit_3
1            1       1       0       0
2            1       0       1       0
3            1       0       0       1
4            2       0       0       1
5            3       0       1       0
6            3       0       0       1
7            4       1       0       0
8            5       1       0       0
9            5       0       1       0
10           5       0       0       1

You may want to pipe in some summariseoperation to make the table even cleaner, as in:

df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
        group_by(participant)%>%
        summarise(across(starts_with('visit'), max))

# A tibble: 5 x 4
  participant visit_1 visit_2 visit_3
        <dbl>   <int>   <int>   <int>
1           1       1       1       1
2           2       0       0       1
3           3       0       1       1
4           4       1       0       0
5           5       1       1       1

In a certain way, this looks a bit like a pivoting operation too. You may be interested in using dplyr::pivot_wider here too

EDIT: @MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider

Thank you so much GuedesBF, I have indeed tried with the function and it worked! Now I can work with my actual dataset which consists with thousands of data point, this really helps! Thank you and please have a nice weekend. — Helen Andrews, Sep 03 '21 at 10:19
I am glad i could help. Please check this: https://stackoverflow.com/help/someone-answers — GuedesBF, Sep 03 '21 at 16:54

How to visualize the data if one participant has multiple entries in different rows?

2 Answers2