1

Sorry to bother you. I have a longitudinal dataset but it is in wide format.

There are 380 variables and over 3000 id's.

I am trying to turn the data into a long format so there is a time variable column.

The time variable is indicated by a suffix infront of each column variable so V0 is the Time at baseline and V01 is the time at 1 year. V02 is the time at second year. V03 is the time at 3 years etc.

Most of the variables are categorised using this time variable but not all. So v0pase is the pase at baseline. V01pase is the pase at 1 year.

Not all the variables have a time variable. Some are categorical e.g gender, depression etc.

I am trying to reshape it into a long format in R however my code does not appear to work:

The database is called "TrialData"

TrialDataLong <- reshape(TrialData, varying = c((grepl("v",))), direction = "long", 
                                              timevar = "visit/year", idvar = 
                                             "id", sep = "v")

I have linked an image of a snapshot of the old data to what I am aiming for.

enter image description here I would be grateful if someone can be kind enough to assist in this. I am fairly newish to R.

Many thanks in advance.

Best wishes James

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • 2
    Please don't post data as images. Take a look at how to make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for ways of showing data. – Martin Gal Aug 07 '21 at 23:03

3 Answers3

2

Here is another solution you could use. Here are some notes with regard to the solution:

  • Except for variables ID and sex other variables of the original data set has a distinct name pattern. It starts with a non-digit character followed by a digit and again a couple non-digit character
  • According to the names of the second data set we are only interested to capture the digit part of the names and the following couple non-digits, discarding the starting non-digit character
  • So I had to define 2 capturing groups, the first is only one digit represented by (\\d) in the regex and the second is at least 3 non-digit represented by ([a-z]{3,}). You may pay attention to {3,} that I left the other end open for example {3,4} means between 3 and 4 characters
  • In this regex "a-z([a-z]{3,})" the first part [a-z] represents any non-digit character corresponding the the starting names of the first data set but as you noticed it is not put in a parenthesis as we don't need to capture them
  • As for names_to argument there are also 2 groups defined corresponding to two capturing groups in names_pattern the first group will be put in Time variable and here is the more interesting .value values which indicates that part of the name (the second group) are actually the values we are trying to measure which are pase and dep

There are other ways of achieving this and that was one way to go about it.

library(dplyr)
library(tidyr)

TrialDataLong %>%
  pivot_longer(starts_with("v"), 
               names_to = c("Time", ".value"), 
               names_pattern = "[a-z](\\d)([A-Za-z]{3,})")

# A tibble: 8 x 5
     id sex   Time   pase dep  
  <dbl> <chr> <chr> <dbl> <chr>
1     1 M     0        10 Yes  
2     1 M     1        12 Yes  
3     1 M     2        14 Yes  
4     1 M     3        13 Yes  
5     2 F     0        15 No   
6     2 F     1        15 Yes  
7     2 F     2        16 Yes  
8     2 F     3        16 Yes  
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    This worked really nicely. Thank you very much. I was wondering if you could kindly explain the meaning behind the notations you used in "names_pattern"="[a-z](\\d)([a-z]{3,})". What does each part mean? I tried googling each bit but it was quite challenging. Presumably all the column variables in the (!c(id, sex) section implies "pivot longer except keep these variables attached to each ID". Hence I can just add more variables this part that I want to keep the same? – JamesLancaster Aug 10 '21 at 08:40
  • You're welcome, I added some notes, let me know if you have any question. By the way by using `!c(id, sex)` I excluded these two columns and use the rest. – Anoushiravan R Aug 10 '21 at 08:55
  • Really interesting and great explanation. When I go to adapt this with all my other column variables and all my rows, it doesnt capture some of the columns and for some reason provides more rows for each ID. – JamesLancaster Aug 10 '21 at 09:36
  • Check the column names, there might be a different pattern among other columns. – Anoushiravan R Aug 10 '21 at 09:38
  • 1
    id mostid sex age race v0bmi v0smoke v0Comorbidity_Score v0Depression v0wototr v0wototl v0menr v0menl v0pase v0KLGLeft v1KLGLeft v2KLGLeft v3KLGLeft v5KLGLeft v0KLGRight v1KLGRight v2KLGRight v3KLGRight v5KLGRight v1bmi v2bmi v5bmi v5menr v5menl v2wototr v2wototl v3wototr v3wototl v3bmi v5wototr v5wototl Right_A_Baseline Left_A_Baseline – JamesLancaster Aug 10 '21 at 09:39
  • These are the column names, from your code, the rules should still apply, I put all the variables that did not have a v: – JamesLancaster Aug 10 '21 at 09:41
  • pivot_longer(!c(id, sex, age, mostid, race, Right_A_Baseline, Left_A_Baseline) names_to = c("Time", ".value"), names_pattern = "[a-z](\\d)([a-z]{3,})" – JamesLancaster Aug 10 '21 at 09:41
  • I made a modification. There are some upper case letters among lower case ones. Check my solution I replace `[a-z]` with `[A-Za-z]`. – Anoushiravan R Aug 10 '21 at 09:42
  • i have managed to capture all the columns however it still creates 10 rows per ID when you'd expect only 4 or 5 rows for each participant. I have put all the columns that do not have a v into the !c(sex, etc) – JamesLancaster Aug 10 '21 at 09:50
  • 1
    %>%pivot_longer(!c(id, sex, age, mostid, race, Right_A_Baseline, Left_A_Baseline), names_to = c("Time", ".value"), names_pattern = "[a-z](\\d)([A-Za-z]{3,})") – JamesLancaster Aug 10 '21 at 09:52
  • I don't why it happens, as I have to check the data sample. Put a `dput(head(data))` with all variables in your question so that I can check. – Anoushiravan R Aug 10 '21 at 09:52
  • structure(list(id = c(1, 2, 3, 4, 5, 6), mostid = c("M0001", "M0002", "M0003", "M0004", "M0005", "M0006"), sex = c(0, 1, 1, 0, 0, 1), age = c(56, 79, 66, 51, 62, 53), race = c(1, 1, 1, 1, 1, 1), v0bmi = c(26.9, 27.99, 23.82, 29.19, 32.03, 24.79), v0smoke = c(0, 0, 0, 1, 1, 0), v0Comorbidity_Score = c(0, 1, 0, 0, 0, 0), v0Depression = c(0, 0, 0, 0, 0, 0), v0wototr = c(0, 26, 1, 0, 19.75, 8), v0wototl = c(0, 21, 0, 0, 17.5, 8), – JamesLancaster Aug 10 '21 at 09:57
  • v0menr = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v0menl = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v0pase = c(217.8, 190.4, 294.4, 147.1, 119.6, 68.4), v0KLGLeft = c(0, 3, 0, 0, 0, 0), v1KLGLeft = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v2KLGLeft = c(0, 3, 0, 0, 0, 0), v3KLGLeft = c(0, NA, 0, 0, 0, NA), v5KLGLeft = c(0, NA, 0, 0, 0, NA), v0KLGRight = c(0, 3, 1, 0, 0, 0), – JamesLancaster Aug 10 '21 at 09:57
  • v1KLGRight = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v2KLGRight = c(0, 4, 1, 0, 0, 0), v3KLGRight = c(0, NA, 1, 1.9, 0, 0), v5KLGRight = c(0, NA, 2, 1.9, 0, 0), v1bmi = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v2bmi = c(27.45, 27.96, 23.59, 25.27, 30.84, 24.36), v5bmi = c(27.32, 23.9, 26.54, 33.37, 25.01, 25.16), v5menr = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), v5menl = c(NA_real_, NA_real_, NA_real_, – JamesLancaster Aug 10 '21 at 09:58
  • NA_real_, NA_real_, NA_real_), v2wototr = c(5, 0, 0, 14, 18, 6), v2wototl = c(5, 0, 0, 10, 18, 7), v3wototr = c(1, NA, 0, 20, 34, 2), v3wototl = c(1, NA, 0, 16, 33, 3), v3bmi = c(27.17, NA, 24.07, 26.69, 33.03, 24.84), v5wototr = c(1, NA, 0, 16.56, 27, 4), v5wototl = c(1, NA, 0, 12.56, 35, 6), Right_A_Baseline = c(2, 2, 2, 1, 0, 2), Left_A_Baseline = c(2, 2, 0, 0, 1, 0), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")) – JamesLancaster Aug 10 '21 at 09:58
  • Sorry I had to truncate it because of the word limit. – JamesLancaster Aug 10 '21 at 09:58
  • What is wrong with the output? I think it's fine. Almost 5 rows per `ID` like the original answer you posted. There are just some NA values. – Anoushiravan R Aug 10 '21 at 10:05
  • 1
    That has now worked perfectly. I want to thank you very much. I cannot thank you enough, this has been brilliant learning. Presumably your new part of the code means [A-Za-z] that any value capital or no capital letters? – JamesLancaster Aug 10 '21 at 10:13
  • Oh that's my pleasure, glad it helped. Yes it matches both uppercase and lowercase letters. For more info read the following page carefully as regex is an extremely useful asset for any programmer: https://en.wikipedia.org/wiki/Regular_expression and in the end please feel free to accept any answer that helped you. – Anoushiravan R Aug 10 '21 at 10:18
1

For the sake of completeness, here is a solution using melt() and the new measure() function:

library(data.table) # development version 1.14.1 used here
TrialDataLong <- 
  melt(setDT(TrialData), measure.vars = measure(time, value.name, pattern = "v(\\d)(\\w+)"))
TrialDataLong 
   id sex time pase dep
1:  1   M    0   10 Yes
2:  2   F    0   15  No
3:  1   M    1   12 Yes
4:  2   F    1   15 Yes
5:  1   M    2   14 Yes
6:  2   F    2   16 Yes
7:  1   M    3   13 Yes
8:  2   F    3   16 Yes

Reordered for better comparison with OP's expected result:

TrialDataLong[order(id, time)]
   id sex time pase dep
1:  1   M    0   10 Yes
2:  1   M    1   12 Yes
3:  1   M    2   14 Yes
4:  1   M    3   13 Yes
5:  2   F    0   15  No
6:  2   F    1   15 Yes
7:  2   F    2   16 Yes
8:  2   F    3   16 Yes

Data

library(data.table)
TrialData <- fread("
id sex v0pase v1pase v0dep v1dep v2pase v3pase v2dep v3dep
 1   M     10     12   Yes   Yes     14     13   Yes   Yes
 2   F     15     15    No   Yes     16     16   Yes   Yes")
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • For some reason, when i treid this, it says the "measure" function does not exist. I have the latest version of data.table. – JamesLancaster Aug 10 '21 at 08:41
  • @JamesLancaster, Please, double check the version number. Have you installed the _development_ version 1.14.1 which is _not_ available on CRAN, yet? Please, see https://github.com/Rdatatable/data.table/wiki/Installation for instructions to install the _development_ version. Thank you. – Uwe Aug 10 '21 at 09:29
1

You can use tidyr::pivot_longer with names_to and names_pattern argument.

tidyr::pivot_longer(TrialData, 
                    cols = -c(id, sex),
                    names_to = c('Time', '.value'), 
                    names_pattern = 'v(\\d+)([a-z]+)')

#     id sex   Time   pase dep  
#  <dbl> <chr> <chr> <dbl> <chr>
#1     1 M     0        10 Yes  
#2     1 M     1        12 Yes  
#3     1 M     2        14 Yes  
#4     1 M     3        13 Yes  
#5     2 F     0        15 No   
#6     2 F     1        15 Yes  
#7     2 F     2        16 Yes  
#8     2 F     3        16 Yes  
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks very much for this. So i can learn for the future and adapt it, can you explain this part to me 'v(\\d+)([a-z]+)'? I could not find a clear explanation online for this. – JamesLancaster Aug 10 '21 at 08:41
  • There are two capture groups in this regex. `(\\d+)` is one or more digits followed after `v`. This value goes in the `Time` column. `[a-z]+` is one more character values which comes after the number. Since we are using `.value` here it has column of it's own. In this example which is `pase` and `dep`. – Ronak Shah Aug 10 '21 at 08:46