Tidy up and reshape messy dataset (reshape/gather/unite function)?

Question

Following my earlier question: R: reshape/gather function to create dataset ready for multilevel analysis

I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story: I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire. In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.

I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner). Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.

This code worked for me for a small subset of my data:

mydf_long <- mydf4 %>% 
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)

results in:

   id groep_MNC zkhs fbeh    pgebdat couple bb1,
1  3         1    1    1 1955-12-01      1    4
2  3         1    1    1 1955-12-01      2    5
3  5         1    1    1 1943-04-09      1    2
4  5         1    1    1 1943-04-09      2    2

But now it copies and pastes the date of birth of the patient also to 'partner' row.

I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:

Example of data

     id groep_MNC zkhs fbeh    pgebdat    p_age   pgesl   prgebdat   pr_age prgesl relpnst
1     3         1    1    1 1955-12-01 42.50000       1       <NA>       NA      2       1
2     5         1    1    1 1943-04-09 55.16667       1 1962-04-18 36.50000      1       2
3     7         1    1    1 1958-04-10 40.25000       1       <NA>       NA      2       1
4    10         1    1    1 1958-04-17 40.25000       1 1957-07-31 41.33333      2       1
5    12         1    1    2 1947-11-01 50.66667       1 1944-06-08 54.58333      2       1

And then, after couple of hundred variables for only patients, this partner data comes along:

     pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
 1      5    5    5    5    2    5    4    2    3
 2      2    1    4    1    3    4    3    3    4
 3      5    3    4    4    4    3    5    3    4
 4      5    3    5    5    5    5    4    4    4
 5      5    5    5    5    5    4    4    3    4

note, I didn't create this dataset myself - I'm just here to tidy up the mess :)

Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.

It seems like you might be overcomplicating the task by trying to keep everything in one table. Would it simplify the task to separate the partner data from the patient data, do the necessary manipulations for each of them separately, and then join together the data you need just before the analysis? — Benjamin, Aug 30 '17 at 14:02
That might be a good idea, but I'm not completely sure how to add the extra rows and keep the data coupled. — Hannie, Aug 30 '17 at 14:06

Jaap · Accepted Answer · 2017-08-30T14:26:02.590

Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:

library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
     value.name = c('age','geslacht','geboortedatum')
     )[, variable := c('patient','partner')[variable]][]

you get:

    id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable      age geslacht geboortedatum
 1:  3         1    1    1       1    5    5  patient 42.50000        1    1955-12-01
 2:  5         1    1    1       2    2    1  patient 55.16667        1    1943-04-09
 3:  7         1    1    1       1    5    3  patient 40.25000        1    1958-04-10
 4: 10         1    1    1       1    5    3  patient 40.25000        1    1958-04-17
 5: 12         1    1    2       1    5    5  patient 50.66667        1    1947-11-01
 6:  3         1    1    1       1    5    5  partner       NA        2          <NA>
 7:  5         1    1    1       2    2    1  partner 36.50000        1    1962-04-18
 8:  7         1    1    1       1    5    3  partner       NA        2          <NA>
 9: 10         1    1    1       1    5    3  partner 41.33333        2    1957-07-31
10: 12         1    1    2       1    5    5  partner 54.58333        2    1944-06-08

Instead of patterns you could also use a list of column indexes or columnnames.

HTH

Used data:

df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L), 
                     groep_MNC = c(1L, 1L, 1L, 1L, 1L),
                     zkhs = c(1L, 1L, 1L, 1L, 1L),
                     fbeh = c(1L, 1L, 1L, 1L, 2L),
                     pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
                     p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
                     pgesl = c(1L, 1L, 1L, 1L, 1L),
                     prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
                     pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
                     prgesl = c(2L, 1L, 2L, 2L, 2L),
                     relpnst = c(1L, 2L, 1L, 1L, 1L),
                     pbb1 = c(5L, 2L, 5L, 5L, 5L),
                     pbb2 = c(5L, 1L, 3L, 3L, 5L)), 
                .Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
                class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Tidy up and reshape messy dataset (reshape/gather/unite function)?

1 Answers1

Linked