1

First up - I don´t think this is a duplicate post. I have found several great posts and webpages on transforming multiple columns from wide to long format, but none of them are similar to my data since they deal with several columns that are measures of the same thing (in my example that would be A1, A2, A3, A4), and also they do not already contain a long format variable (frame in my case).


Here is my question:

I am working with a dataset containing many variabels measured by two different motion capture systems. Currently my dataset is in wide format, but I´ve realized ggplot works much better with long format, and I therefore wish to transform my data.

Here is a very simplified version of my data:

id <- (rep(1:3, each = 3))
frame <- (rep(1:3, 3))
A1 <- runif(9, min =1, max =100)
B1 <- runif(9, min =1, max =10)
C1 <- runif(9, min =-10, max =10)
A2 <- rnorm(9, mean = A1, sd=1)
B2 <- rnorm(9, mean = B1, sd=1)
C2 <- rnorm(9, mean = C1, sd=1)
df_wide <- as.data.frame.matrix(cbind(id, frame, A1, B1, C1, A2, B2, C2))
rm(id, frame, A1, A2, B1, B2, C1, C2)

df_wide$id <- as.factor(df_wide$id)
df_wide$frame <- as.factor(df_wide$frame)

head(df_wide)

  id frame        A1       B1        C1        A2       B2         C2
1  1     1 50.940395 4.141713 -1.294736 51.324398 4.271260  0.6174782
2  1     2 33.117691 5.044080  1.820367 32.977860 5.506677  0.8811504
3  1     3 50.000625 8.584148 -1.294245 50.603195 8.099262  0.6418580
4  2     1 61.675927 5.269216 -6.002856 61.996378 6.186417 -6.5428624
5  2     2  5.514353 6.570010  5.199728  4.798275 4.955662  5.1502535
6  2     3 51.580086 5.683788  9.831663 50.717459 5.430070 10.9601541

A1 and A2, B1 and B2, are measures of the same type of movements (A and B) made by system 1 and system 2. Each patient has been measured several times as indicated by the frame variable.

I would like my data.frame to look like this:

  id frame system        A        B         C
1  1     1      1 
2  1     1      2 
3  1     2      1 
4  1     2      2 
5  1     3      1 
6  1     3      2 

I have two problems keeping me from solving this:

1) measurements between the two systems are not placed next to each other. Therefore I cant use code like this:

library(tidyr)
df_long <- gather(df_wide, System, A, A1:A2, factor_key=TRUE)

2) My dataset contains close to 120 variabels, so I would like a solution that wouldn't require me to write the code for each variable. I am considering making a loop to solve this, but any help in this regard will be much appreciated as well.

Steen Harsted
  • 1,802
  • 2
  • 21
  • 34

2 Answers2

2

The tidyr approach would be: 1) gather the measures columns, 2) separate the headers into movements (alpha) + system(numeric) using extract with regex, 3) spread movements to headers:

library(tidyr)

df_wide %>% 
    gather(keys, values, -id, -frame) %>% 
    extract(keys, c("movements", "system"), "([a-zA-Z]+)([0-9]+)") %>% 
    spread(movements, values)

#   id frame system         A        B         C
#1   1     1      1 62.175823 9.661748 -9.120404
#2   1     1      2 62.957358 9.229938 -8.814429
#3   1     2      1 22.463641 3.904546  4.059267
#4   1     2      2 22.798492 3.045190  4.663611
#5   1     3      1 13.897632 6.675986 -9.528184
#6   1     3      2 15.036539 6.964412 -8.920507
#7   2     1      1 38.765030 7.735174  8.373283
#8   2     1      2 40.124285 4.947368 10.143035
#9   2     2      1  5.924254 9.358200  9.866305
#10  2     2      2  5.197255 9.859347 10.088928
#11  2     3      1 29.961107 7.451472 -3.143658
#12  2     3      2 31.322740 8.328626 -2.050261
#13  3     1      1 71.010782 6.909414  7.128306
#14  3     1      2 69.860047 7.675693  7.817473
#15  3     2      1 64.985282 1.596932 -3.422237
#16  3     2      2 64.839996 2.828168 -3.826748
#17  3     3      1 70.631159 1.238806  5.398818
#18  3     3      2 70.963814 1.255340  3.728302
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thank you. This is a very elegant solution. I have a question about: extract(keys, c("movements", "system"), "([a-zA-Z]+)([0-9]+)") In my real data set the variabels are not called A1 and A2, but rather A and cA. Since I dont understand the last part of the expression it is difficult for me to transform this line into something that works in my dataset. Can you help me again there? – Steen Harsted Oct 22 '17 at 17:56
  • can you give more details about how columns look like? So how can you distinguish between movement and system? – Psidom Oct 22 '17 at 18:15
  • It is identical to the above, but instead of: A1, B1, C1, A2, B2, C2. I have: A, B, C, cA, cB, cC. Movement is: A, B C. System is: 1 if there is no letter in front movement, 2 if there is a c. – Steen Harsted Oct 22 '17 at 18:32
  • In your question you said you have 120+ variables, so I guess this is still a simplified version of what you have. But given columns *A, B, C, cA, cB, cC* you can replace the `extract` line with `extract(keys, c("system", "movements"), "([a-z]*)([A-Z].*)")` and recode `system` column later. – Psidom Oct 22 '17 at 18:47
  • 1
    Yes, you are right it is still a simplified version, but the naming convention is the same. You solution worked worked like a charm, and I now have tidy data. Thank you very much! – Steen Harsted Oct 22 '17 at 19:05
2

Run reshape and then sort the result.

The first 4 lines set up the arguments to reshape. In particular, varying is list(A = c("A1", "A2"), B = c("B1", "B2"), C = c("C1", "C2")) . The last line of code sorts the rows and that could be omitted if the row order is unimportant.

Here the A... columns have the same type as the B... columns and the C... columns but this solution would continue to work even if that were not the case.

No packages are used.

This SO question/answer is similar but there are small differences: Gather multiple date/value columns using tidyr

idvar <- 1:2
nms <- names(df_wide)[-idvar]   # names of non-id variables
varying <- split(nms, sub("\\d+$", "", nms))
v.names <- names(varying)
r <- reshape(df_wide, dir = "long", varying = varying, v.names = v.names, idvar = idvar)
r[order(r$id, r$frame), ]

giving:

      id frame time         A        B          C
1.1.1  1     1    1 50.940395 4.141713 -1.2947360
1.1.2  1     1    2 51.324398 4.271260  0.6174782
1.2.1  1     2    1 33.117691 5.044080  1.8203670
1.2.2  1     2    2 32.977860 5.506677  0.8811504
1.3.1  1     3    1 50.000625 8.584148 -1.2942450
1.3.2  1     3    2 50.603195 8.099262  0.6418580
2.1.1  2     1    1 61.675927 5.269216 -6.0028560
2.1.2  2     1    2 61.996378 6.186417 -6.5428624
2.2.1  2     2    1  5.514353 6.570010  5.1997280
2.2.2  2     2    2  4.798275 4.955662  5.1502535
2.3.1  2     3    1 51.580086 5.683788  9.8316630
2.3.2  2     3    2 50.717459 5.430070 10.9601541

Note: The input in reproducible form is the following -- the code generating the input in the question is not reproducible because random numbers were used without set.seed.

df_wide <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), frame = c(1L, 
2L, 3L, 1L, 2L, 3L), A1 = c(50.940395, 33.117691, 50.000625, 
61.675927, 5.514353, 51.580086), B1 = c(4.141713, 5.04408, 8.584148, 
5.269216, 6.57001, 5.683788), C1 = c(-1.294736, 1.820367, -1.294245, 
-6.002856, 5.199728, 9.831663), A2 = c(51.324398, 32.97786, 50.603195, 
61.996378, 4.798275, 50.717459), B2 = c(4.27126, 5.506677, 8.099262, 
6.186417, 4.955662, 5.43007), C2 = c(0.6174782, 0.8811504, 0.641858, 
-6.5428624, 5.1502535, 10.9601541)), .Names = c("id", "frame", 
"A1", "B1", "C1", "A2", "B2", "C2"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for this thorough reply. I was not aware that (re)generation of the actual values could be of interest. I will remember to use set.seed in the furture. – Steen Harsted Oct 22 '17 at 18:02