0

I have wide dataset relating to cases and their contacts. (This is a made up example; the real dataset is much larger).

structure(list(record_id = structure(1:4, .Label = c("01-001", 
"01-002", "01-003", "01-004"), class = "factor"), place = structure(c(1L, 
2L, 1L, 1L), .Label = c("a", "b"), class = "factor"), sex = structure(c(2L, 
2L, 1L, 2L), .Label = c("F", "M"), class = "factor"), age = c(4L, 
13L, 28L, 44L), d02_1 = c(2L, 2L, NA, 2L), d02_2 = structure(c(3L, 
2L, 1L, 3L), .Label = c("", "F", "M"), class = "factor"), d02_3 = c(27L, 
16L, NA, 66L), d03_1 = c(3L, 3L, NA, 3L), d03_2 = structure(c(3L, 
3L, 1L, 2L), .Label = c("", "F", "M"), class = "factor"), d03_3 = c(14L, 
55L, NA, 12L), d04_1 = c(4L, NA, NA, NA), d04_2 = structure(c(2L, 
1L, 1L, 1L), .Label = c("", "M"), class = "factor"), d04_3 = c(7L, 
NA, NA, NA)), .Names = c("record_id", "place", "sex", "age", 
"d02_1", "d02_2", "d02_3", "d03_1", "d03_2", "d03_3", "d04_1", 
"d04_2", "d04_3"), row.names = c(NA, -4L), class = "data.frame")

Where:

  • record_id is the unique identifier of the case
  • place is the place where the case lives
  • age is case's age
  • sex is case's sex

  • d02_1, d03_1, d04_1 ... d0j_1 are contact's ids

  • d02_2, d03_2, d04_2 ... d0j_2 are contact's sex
  • d02_3, d03_3, d04_3 ... d0j_3 are contact's age

In the real dataset, there are potentially many contacts per case, and many more variables relating to contact's characteristics. Not all cases will have contacts.

I want to reshape the dataset to a tidy format, with one row per case/contact, ie:

         id case place sex age
1    01-001    1     a   M   4
2  01-001-2    0     a   M  27
3  01-001-3    0     a   M  14
4  01-001-4    0     a   M   7
5    01-002    1     b   M  13
6  01-002-2    0     b   F  16
7  01-002-3    0     b   M  55
8    01-003    1     a   F  28
9    01-004    1     a   M  44
10 01-004-2    0     a   M  66
11 01-004-3    0     a   F  12

I am thinking that I will need to create vectors of columns names relating to each contact (potentially using character-matching on column names), select these columns sequentially, and append them to each other (as well as concatenating the case/contact ids), but really struggling to without lots and lots of copying of lines of code. Must be a more efficient method?

Peter MacPherson
  • 683
  • 1
  • 7
  • 17
  • 1
    Does this not help: http://stackoverflow.com/questions/40229114/tidyrgather-multiple-columns-of-varying-types?rq=1. Seems like the same thing basically. – MrFlick Mar 07 '17 at 16:16
  • You should probably set `na.strings=''` when reading in. It doesn't make much sense / makes everything harder to have blanks there... – Frank Mar 07 '17 at 16:26

1 Answers1

0

Is this what you are looking for?

It is a dplyr solution that is ugly for a number of reasons, but I think it gets the job done.

DF <- DF %>%
  rename_(.dots=setNames(names(.), gsub('_1','_ContactID',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_2','_sex',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_3','_age',names(.)))) %>%
  rename(d00_sex=sex,d00_age=age) %>%
  mutate(d00_ContactID=1) %>%
  gather(Var,Val,-record_id,-place) %>%
  mutate(Val =ifelse(Val=='',NA,Val)) %>%
  separate(Var,c('ContactLevel','Var'),sep='_') %>%
  spread(Var,Val) %>%
  arrange(record_id,ContactLevel) %>%
  filter(!is.na(age),!is.na(ContactID),!is.na(sex)) %>%
  mutate(age = as.numeric(age))

I start off by renaming your variables for clarity. (rename_ lines)

Next, I put your case info variables into a consistent pattern where the case info is ContactID=1. (ename and mutate lines)

Gather turns the data from wide to long, but leaves us with one very ugly column and converts all your data to character. (This is the ugly part where the warning is triggered.)

separate splits the old column names into Contact ID and the data column.

spread then opens up the age, sex and ID into columns again. At this line these data are what you want, but can still be cleaned up a bit.

arrange is not necessary, but it puts all of the record IDs together.

filter is also not necessary, it just removes the rows with no contract information.

Finally, I use mutate to turn age from character to numeric. If you wish you can also turn sex into a factor here, and possibly contact ID as well.

Bishops_Guest
  • 1,422
  • 13
  • 13