1

I have a dataset (DF) that looks like what I have below:

   ID DOB      Age Outcome    
   1  1/01/80  18     1
   1  1/01/80  18     0
   2  1/02/81  17     1
   2  1/02/81  17     0
   3  1/03/70  28     1

I want to change my database to wide format, so that I have one row per ID. However, given that DOB and Age are the same for each ID, I want these variables to be a single column in the new database and simply have multiple columns for the Outcome variable, as below:

   ID DOB      Age Outcome.1 Outcome.2    
   1  1/01/80  18     1         0
   2  1/02/81  17     1         0
   3  1/03/70  28     1         NA

I have tried using tidyr and reshape, but I can't seem to get the database into this format. For example when I use the code:

spread(DF, key=ID, value = Outcome)

I get an error that indicates that I have duplicate identifiers for rows. Is there a way to get the database into the format I would like?

Thanks.

user183974
  • 173
  • 9
  • Possible duplicate of https://stackoverflow.com/questions/45898614/how-to-spread-columns-with-duplicate-identifiers or https://stackoverflow.com/questions/43259380/spread-with-duplicate-identifiers-using-tidyverse-and or https://stackoverflow.com/questions/25960394/spread-with-data-frame-tibble-with-duplicate-identifiers maybe. – thelatemail Feb 07 '18 at 22:19

3 Answers3

2

One solution could be achieved by following steps using tidyverse. The idea is to add row number to a column to provide a unique ID for each row. Afterwards there are different ways to apply spread.

df <- read.table(text = "ID DOB      Age Outcome    
1  1/01/80  18     1
1  1/01/80  18     0
2  1/02/81  17     1
2  1/02/81  17     0
3  1/03/70  28     1", header = T, stringsAsFactors = F)

library(tidyverse)

df %>% mutate(rownum = row_number(), Outcome = paste("Outcome",Outcome,sep=".")) %>%
  spread(Outcome, rownum) %>%
  mutate(Outcome.0 = ifelse(!is.na(Outcome.0),0, NA )) %>%
  mutate(Outcome.1 = ifelse(!is.na(Outcome.1),1, NA ))

# Result:
#  ID     DOB Age Outcome.0 Outcome.1
#1  1 1/01/80  18         0         1
#2  2 1/02/81  17         0         1
#3  3 1/03/70  28        NA         1
MKR
  • 19,739
  • 4
  • 23
  • 33
1

The dcast function is used for things like this.

dcast(data, ID + DOB + Age ~ Outcome)
C-x C-c
  • 1,261
  • 8
  • 20
  • Thanks. That worked well. What happens if I want to spread more than one outcome variable? – user183974 Feb 08 '18 at 02:21
  • You would add another value to the right hand side of the equation. So: dcast(data, ID + DOB + Age ~ Outcome + var2) – C-x C-c Feb 08 '18 at 21:01
  • This doesn't work, it spreads the values of Outcome and var2 such that I get values corresponding to something like a cross-tabulation of the two variables. So, let's say my values for Outcome are 1 and 0 and my values for var2 are "yes" and "no". I get the following col names in the resulting dataset "1yes", "0yes","1no","0no". I would like to have each value separate. Something like "1","0","yes","no". – user183974 Feb 10 '18 at 04:51
1

You could use tidyr and dplyr:

   DF %>%
      group_by(ID) %>%
      mutate(OutcomeID = paste0('Outcome.', row_number())) %>%
      spread(OutcomeID, Outcome)
Martin C. Arnold
  • 9,483
  • 1
  • 14
  • 22