Restructuring dataset with seveal identical IDs in different rows

Question

I have a dataset with IDs listed in column 1 and then diagnosis in column 2, cause in column 3 and so on. For IDs with several diagnosis these are listed as one row per diagnosis so the same ID with have several rows. I would like to have everything as one row per ID and the several diagnosis as columns. Is this possible?

My data look somethign like this:

ID diagnosis cause_of_diagnosis
1   A  A
1   B  B
1   C  C
2   A  A
3   A  A
3   B  B
3   C  C

I would like the data to end up looking something like:

ID diagnosis_1 diagnosis_2 diagnosis_3 cause_of_diagnosis_1 cause_of_diagnosis_2 cause_of_diagnosis_3
 1  A  B  C  A  B  C
 2  A  -  -  A  -  -
 3  A  B  C  A  B  C

Ronak Shah · Answer 1 · 2019-09-11T00:55:15.357

0

We can use dplyr and tidyr to reshape the data. Use gather to get data in long format, group_by ID and key to create a new column name for every "diagnosis" and "cause_of_diagnosis" and spread the data in wide format.

library(dplyr)
library(tidyr)

df %>%
  gather(key, value, -ID) %>%
  group_by(ID, key) %>%
  mutate(key1 = paste(key, row_number(), sep = "_")) %>%
  ungroup() %>%
  select(-key) %>%
  spread(key1, value)

#     ID cause_of_diagnosis_1 cause_of_diagnosis_2 cause_of_diagnosis_3 diagnosis_1 diagnosis_2 diagnosis_3
#  <int> <chr>                <chr>                <chr>                <chr>       <chr>       <chr>      
#1     1 A                    B                    C                    A           B           C          
#2     2 A                    NA                   NA                   A           NA          NA         
#3     3 A                    B                    C                    A           B           C

data

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L),diagnosis = structure(c(1L, 
2L, 3L, 1L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
cause_of_diagnosis = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 
3L), .Label = c("A", "B", "C"), class = "factor")), 
class = "data.frame", row.names = c(NA, -7L))

edited Sep 11 '19 at 00:55

answered Sep 11 '19 at 00:34

Ronak Shah

377,200
20
156
213

Ok, thanks! Do you put any values in for key, key1 and value or do you just run this as is? – Gunn-Helen Moen Sep 11 '19 at 00:53
@Gunn-HelenMoen `key` and `key1` are generated on the fly. Run this as is. – Ronak Shah Sep 11 '19 at 00:54
OK - I'm not sure this is doing exactly what I want then. I'm currently not getting any diagnosis columns - but about 700 different cause of diagnosis columns – Gunn-Helen Moen Sep 11 '19 at 01:01
@Gunn-HelenMoen are you sure there is no diagnosis column in the data? The "diagnosis" column would be at the end since the columns are sorted alphabetically. Store the result of above in `df1` and run `grep("^diagnosis", names(df1))` what do you get ? – Ronak Shah Sep 11 '19 at 01:12
yes, there are diagnosis columns as well, but it looks like there are a few hundred of them - which I'm not quite sure as of why that is – Gunn-Helen Moen Sep 11 '19 at 01:23
How many "diagnosis" and "cause_of_diagnosis" columns do you expect ? how many rows are present in your data ? – Ronak Shah Sep 11 '19 at 01:42

Restructuring dataset with seveal identical IDs in different rows

1 Answers1