spreading a binary variable by a grouping variable r

Question

I have a dataset (DF) that looks like what I have below:

   ID DOB      Age Outcome    
   1  1/01/80  18     1
   1  1/01/80  18     0
   2  1/02/81  17     1
   2  1/02/81  17     0
   3  1/03/70  28     1

I want to change my database to wide format, so that I have one row per ID. However, given that DOB and Age are the same for each ID, I want these variables to be a single column in the new database and simply have multiple columns for the Outcome variable, as below:

   ID DOB      Age Outcome.1 Outcome.2    
   1  1/01/80  18     1         0
   2  1/02/81  17     1         0
   3  1/03/70  28     1         NA

I have tried using tidyr and reshape, but I can't seem to get the database into this format. For example when I use the code:

spread(DF, key=ID, value = Outcome)

I get an error that indicates that I have duplicate identifiers for rows. Is there a way to get the database into the format I would like?

Thanks.

Possible duplicate of https://stackoverflow.com/questions/45898614/how-to-spread-columns-with-duplicate-identifiers or https://stackoverflow.com/questions/43259380/spread-with-duplicate-identifiers-using-tidyverse-and or https://stackoverflow.com/questions/25960394/spread-with-data-frame-tibble-with-duplicate-identifiers maybe. — thelatemail, Feb 07 '18 at 22:19

MKR · Answer 1 · 2018-02-07T22:18:07.193

One solution could be achieved by following steps using tidyverse. The idea is to add row number to a column to provide a unique ID for each row. Afterwards there are different ways to apply spread.

df <- read.table(text = "ID DOB      Age Outcome    
1  1/01/80  18     1
1  1/01/80  18     0
2  1/02/81  17     1
2  1/02/81  17     0
3  1/03/70  28     1", header = T, stringsAsFactors = F)

library(tidyverse)

df %>% mutate(rownum = row_number(), Outcome = paste("Outcome",Outcome,sep=".")) %>%
  spread(Outcome, rownum) %>%
  mutate(Outcome.0 = ifelse(!is.na(Outcome.0),0, NA )) %>%
  mutate(Outcome.1 = ifelse(!is.na(Outcome.1),1, NA ))

# Result:
#  ID     DOB Age Outcome.0 Outcome.1
#1  1 1/01/80  18         0         1
#2  2 1/02/81  17         0         1
#3  3 1/03/70  28        NA         1

score 1 · Accepted Answer · answered Feb 07 '18 at 21:53

1

The dcast function is used for things like this.

dcast(data, ID + DOB + Age ~ Outcome)

answered Feb 07 '18 at 21:53

C-x C-c

1,261
8
20

Thanks. That worked well. What happens if I want to spread more than one outcome variable? – user183974 Feb 08 '18 at 02:21
You would add another value to the right hand side of the equation. So: dcast(data, ID + DOB + Age ~ Outcome + var2) – C-x C-c Feb 08 '18 at 21:01
This doesn't work, it spreads the values of Outcome and var2 such that I get values corresponding to something like a cross-tabulation of the two variables. So, let's say my values for Outcome are 1 and 0 and my values for var2 are "yes" and "no". I get the following col names in the resulting dataset "1yes", "0yes","1no","0no". I would like to have each value separate. Something like "1","0","yes","no". – user183974 Feb 10 '18 at 04:51

score 1 · Answer 3 · answered Feb 07 '18 at 21:58

1

You could use tidyr and dplyr:

   DF %>%
      group_by(ID) %>%
      mutate(OutcomeID = paste0('Outcome.', row_number())) %>%
      spread(OutcomeID, Outcome)

answered Feb 07 '18 at 21:58

Martin C. Arnold

9,483
1
14
22

spreading a binary variable by a grouping variable r

3 Answers3