2

I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.

Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.

Let's say this is my data:

id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))

mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names

>mydf

   id p1 p2 p3 pr1 pr2 pr3
1  1  6  20 22 1   10  24
2  2  8  11 24 2   18  29
3  3  7  10 25 6   20  26
4  4  3  14 20 10  15  20
5  5  5  19 29 7   14  22

I want my data finally to look like this:

id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))

mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2

>mydf
   id couple q1 q2 q3
1   1      1  6 23 16
2   1      2 10 28 10
3   2      1  1 27 14
4   2      2  7 21 20
5   3      1  5 30 18
6   3      2 12  2 27
7   4      1 10  1 25
8   4      2 13  7 21
9   5      1 11  6 20
10  5      2 18  3 23

Or, if this is not possible, like this:

   id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1   1      1  6  23  16
2   1      2             10   28   10
3   2      1  1  27  14
4   2      2             7    21   20
5   3      1  5  30  18
6   3      2             12   2    27
7   4      1 10   1  25
8   4      2             13   7    21
9   5      1 11   6  20
10  5      2             18   3    23

Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.

note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire

Example of what I've tried

df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)
Jaap
  • 81,064
  • 34
  • 182
  • 193
Hannie
  • 417
  • 5
  • 17

2 Answers2

2

If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:

library(tidyverse)
set.seed(47)    # for reproducibility

mydf <- data.frame(id = c(1:5),
                   p1 = c(sample(1:10, 5)),
                   p2 = c(sample(10:20, 5)),
                   p3 = c(sample(20:30, 5)),
                   pr1 = c(sample(1:10, 5)),
                   pr2 = c(sample(10:20, 5)),
                   pr3 = c(sample(20:30, 5)))

mydf_long <- mydf %>% 
    gather(var, val, -id) %>% 
    separate(var, c('couple', 'q'), -2) %>% 
    mutate(q = paste0('q', q)) %>% 
    spread(q, val)

mydf_long
#>    id couple q1 q2 q3
#> 1   1      p 10 17 21
#> 2   1     pr 10 11 24
#> 3   2      p  4 13 27
#> 4   2     pr  4 15 20
#> 5   3      p  7 14 30
#> 6   3     pr  1 14 29
#> 7   4      p  6 18 24
#> 8   4     pr  8 20 30
#> 9   5      p  9 16 23
#> 10  5     pr  3 18 25
alistaire
  • 42,459
  • 4
  • 77
  • 117
0

One approach would be to use unite and separate in tidyr, along with the gather function as well.

I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:

mydf %>% 
  unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
  unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
  gather(couple, value, `1`:`2`) %>% # Form into long data
  separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
  arrange(id)

Which gives you:

   id couple q1 q2 q3
1   1      1  9 18 25
2   1      2 10 18 30
3   2      1  1 11 29
4   2      2  2 15 29
5   3      1 10 19 26
6   3      2  3 19 25
7   4      1  7 10 23
8   4      2  1 20 28
9   5      1  6 16 21
10  5      2  5 12 26

Our numbers are different since they were all randomly generated with sample.


Edited per @alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.

Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35
  • I got this working for a small subset of my dataset, in which I also included variables like date of birth and which hospital they were treated in. Do you know if it's possible to write something in this piece of code so that these values don't 'duplicate'? Now, for the proxy, it receives the same date of birth as the patient. – Hannie Aug 30 '17 at 13:01
  • Hmm, the above should be flexible enough to handle additional columns. You may need to provide a sample of your working data frame (filled with dummy data) for me to replicate and better understand the issue. – Dave Gruenewald Aug 30 '17 at 16:06
  • 1
    Careful, you've turned everything but `id` into strings with `unite`. You can fix the issue by adding `convert = TRUE` to the `gather` and `separate` calls. – alistaire Aug 31 '17 at 02:00
  • Nice call! And definite oversight on my part. I've edited now to include this. – Dave Gruenewald Aug 31 '17 at 14:41