Select multiple columns and reshape wide to long

Question

I have wide dataset relating to cases and their contacts. (This is a made up example; the real dataset is much larger).

structure(list(record_id = structure(1:4, .Label = c("01-001", 
"01-002", "01-003", "01-004"), class = "factor"), place = structure(c(1L, 
2L, 1L, 1L), .Label = c("a", "b"), class = "factor"), sex = structure(c(2L, 
2L, 1L, 2L), .Label = c("F", "M"), class = "factor"), age = c(4L, 
13L, 28L, 44L), d02_1 = c(2L, 2L, NA, 2L), d02_2 = structure(c(3L, 
2L, 1L, 3L), .Label = c("", "F", "M"), class = "factor"), d02_3 = c(27L, 
16L, NA, 66L), d03_1 = c(3L, 3L, NA, 3L), d03_2 = structure(c(3L, 
3L, 1L, 2L), .Label = c("", "F", "M"), class = "factor"), d03_3 = c(14L, 
55L, NA, 12L), d04_1 = c(4L, NA, NA, NA), d04_2 = structure(c(2L, 
1L, 1L, 1L), .Label = c("", "M"), class = "factor"), d04_3 = c(7L, 
NA, NA, NA)), .Names = c("record_id", "place", "sex", "age", 
"d02_1", "d02_2", "d02_3", "d03_1", "d03_2", "d03_3", "d04_1", 
"d04_2", "d04_3"), row.names = c(NA, -4L), class = "data.frame")

Where:

record_id is the unique identifier of the case
place is the place where the case lives
age is case's age
sex is case's sex
d02_1, d03_1, d04_1 ... d0j_1 are contact's ids
d02_2, d03_2, d04_2 ... d0j_2 are contact's sex
d02_3, d03_3, d04_3 ... d0j_3 are contact's age

In the real dataset, there are potentially many contacts per case, and many more variables relating to contact's characteristics. Not all cases will have contacts.

I want to reshape the dataset to a tidy format, with one row per case/contact, ie:

         id case place sex age
1    01-001    1     a   M   4
2  01-001-2    0     a   M  27
3  01-001-3    0     a   M  14
4  01-001-4    0     a   M   7
5    01-002    1     b   M  13
6  01-002-2    0     b   F  16
7  01-002-3    0     b   M  55
8    01-003    1     a   F  28
9    01-004    1     a   M  44
10 01-004-2    0     a   M  66
11 01-004-3    0     a   F  12

I am thinking that I will need to create vectors of columns names relating to each contact (potentially using character-matching on column names), select these columns sequentially, and append them to each other (as well as concatenating the case/contact ids), but really struggling to without lots and lots of copying of lines of code. Must be a more efficient method?

Does this not help: http://stackoverflow.com/questions/40229114/tidyrgather-multiple-columns-of-varying-types?rq=1. Seems like the same thing basically. — MrFlick, Mar 07 '17 at 16:16
You should probably set `na.strings=''` when reading in. It doesn't make much sense / makes everything harder to have blanks there... — Frank, Mar 07 '17 at 16:26

score 0 · Accepted Answer · answered Mar 07 '17 at 17:13

Is this what you are looking for?

It is a dplyr solution that is ugly for a number of reasons, but I think it gets the job done.

DF <- DF %>%
  rename_(.dots=setNames(names(.), gsub('_1','_ContactID',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_2','_sex',names(.)))) %>%
  rename_(.dots=setNames(names(.), gsub('_3','_age',names(.)))) %>%
  rename(d00_sex=sex,d00_age=age) %>%
  mutate(d00_ContactID=1) %>%
  gather(Var,Val,-record_id,-place) %>%
  mutate(Val =ifelse(Val=='',NA,Val)) %>%
  separate(Var,c('ContactLevel','Var'),sep='_') %>%
  spread(Var,Val) %>%
  arrange(record_id,ContactLevel) %>%
  filter(!is.na(age),!is.na(ContactID),!is.na(sex)) %>%
  mutate(age = as.numeric(age))

I start off by renaming your variables for clarity. (rename_ lines)

Next, I put your case info variables into a consistent pattern where the case info is ContactID=1. (ename and mutate lines)

Gather turns the data from wide to long, but leaves us with one very ugly column and converts all your data to character. (This is the ugly part where the warning is triggered.)

separate splits the old column names into Contact ID and the data column.

spread then opens up the age, sex and ID into columns again. At this line these data are what you want, but can still be cleaned up a bit.

arrange is not necessary, but it puts all of the record IDs together.

filter is also not necessary, it just removes the rows with no contract information.

Finally, I use mutate to turn age from character to numeric. If you wish you can also turn sex into a factor here, and possibly contact ID as well.

Yes - this does the job. Nice use of setNames and gsub – Peter MacPherson Mar 07 '17 at 17:48 — Peter MacPherson, Mar 07 '17 at 17:48

Select multiple columns and reshape wide to long

1 Answers1