2

I need to convert a wide dataset to long and there are 16 columns which must converge to 4. Each 4 columns contain information related to one another and that information must not be "lost" in the transformation.

I have data from a ranking task of four block which has essentially given me a data set where the information is divided into four groups in a wide format. I.e first_image, first_sex, first_score, second_image, second_sex, second_score...

I have tried various combinations of group_by and gather() but I'm nowhere close.

I've already read Reshaping multiple sets of measurement columns (wide format) into single columns (long format) but I'm none the wiser I'm afraid.

I've made some sample data of what one participant's data looks like and I've also made a sample of how I would like the data to look.


library(tidyverse)

sample_dat <- data.frame(subject_id = rep("sj1", 4),
                         first_pick = rep(1, 4),
                         first_image_pick = (c("a", "b", "c", "d")),
                         first_pick_neuro = rep("TD", 4),
                         first_pick_sex = rep("F", 4),
                         second_pick = rep(2, 4),
                         second_image_pick = (c("e", "f", "g", "h")),
                         second_pick_neuro = rep("TD", 4),
                         second_pick_sex = rep("M", 4),
                         third_pick = rep(3, 4),
                         third_image_pick = (c("i", "j", "k", "l")),
                         third_pick_neuro = rep("DS", 4),
                         third_pick_sex = rep("F", 4),
                         fourth_pick = rep(4, 4),
                         fourth_image_pick = (c("m", "n", "o", "p")),
                         fourth_pick_neuro = rep("DS", 4),
                         fourth_pick_sex = rep("M", 4))

Expected output:


final_data <- data.frame(subject_id = rep("sj1", 16),
                         image = c("a", "b", "c", "d",
                                   "e", "f", "g", "h",
                                   "i", "j", "k", "l",
                                   "m", "n", "o", "p"),
                         rank = rep(c(1, 2, 3, 4), each = 4), # from the numbers in the first_pick, second_pick etc. 
                         neuro = rep(c("TD", "DS"), each = 8),
                         sex = rep(c("F", "M", "F", "M"), each = 4))

So far I've tried this, however it only duplicate all the information:


sample_dat_long <- sample_dat %>%
  group_by(subject_id) %>%
  gather(Pick, Image,
         first_image_pick,
         second_image_pick,
         third_image_pick,
         fourth_image_pick)  

So essentially I don't want to lose the information for each image (pick, sex, neuro) when I gather my data.

Any help would be amazing!

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Julia M
  • 133
  • 1
  • 9
  • Do you need `final_data` as output – akrun Jul 08 '19 at 15:51
  • Not necessarily, as long as my data frame looks the way it supposed to. – Julia M Jul 08 '19 at 15:52
  • So, the `final_data` is just to show the structure of expected output? – akrun Jul 08 '19 at 15:56
  • Yes precisely, I've updated the question. – Julia M Jul 08 '19 at 16:03
  • How exactly do the columns in the first data frame correspond to those in the second? Where does the rank come from? You also might have better luck with `data.table::dcast` than with `tidyr::gather`, since it has the ability to reshape with multiple columns built in – camille Jul 08 '19 at 16:19
  • My apologies, the rank comes from first_pick, second_pick, third_pick and fourth_pick. – Julia M Jul 08 '19 at 16:20
  • 1
    try with `melt` `melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), value.name = c("image", "neuro", "sex"))` – akrun Jul 08 '19 at 16:26
  • What package is that from? – Julia M Jul 08 '19 at 16:29
  • It is from `data.table`. BTW, what is the 'rank' column? – akrun Jul 08 '19 at 16:37
  • Well it's just a score whether a picture was picked first (1), second (2), third (3) or fourth (4) in a task. In the final_data table it comes from the first_pick, second_pick etc. in the sample table. – Julia M Jul 08 '19 at 16:40
  • Can you check `melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), value.name = c("image", "neuro", "sex"), variable.name = 'rank')[, .(subject_id, rank, image, neuro, sex)]` whether this s the expected – akrun Jul 08 '19 at 16:55

3 Answers3

2

We can do this with melt from data.table which can take multiple measure patterns to reshape from 'wide' to 'long' format. Here, the column names with substrings 'image', 'neuro', 'sex' are reshaped into separate columns to get the expected output

library(data.table)
melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), 
   value.name = c("image", "neuro", "sex"), variable.name = 'rank')[, 
    .(subject_id, rank, image, neuro, sex)]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

I guess you can do it column by column, as you want only 4 columns in the end. Getting the indices of the column that should be put together in the first one (if i understood correctly):

  ind1 = seq(2,length(sample_dat[1,]), 4) 
  column1 = gather( sample_dat[,ind1] )[2]

And then repeat for all 3 other columns:

  ind2 = seq(3,length(sample_dat[1,]), 4) 
  column2 = gather( sample_dat[,ind2] )[2]

And you could even do this 4 columns with a for loop to not do it "manually". And combine them back into a dataframe afterward

ZiGaelle
  • 744
  • 1
  • 9
  • 21
  • Hmm I tried this on the sample data and the subject is column is not there. I'm hesitant to use this method as I actually have 78 subjects and am worried that something's gonna go wrong when columns aren't grouped together. – Julia M Jul 08 '19 at 16:23
1

It would be worth to consider good column names (i.e. "<variable_chr>.<time_num>"). But we can fix it in a second.

pfx <- c("first", "second", "third", "fourth")

names(sample_dat)[-1] <- sapply(names(sample_dat)[-1], function(x) {
  x <- gsub("_pick", "", x)
  if (lengths(strsplit(x, "_")) == 2)
    sub("(^.*)_(.*)", paste("\\2", which(pfx == sub("(^.*)_.+", "\\1", x)), sep="."), x)
  else
    paste0("rank.", which(pfx == x))
})

names(sample_dat)  # good names now
# [1] "subject_id" "rank.1"     "image.1"    "neuro.1"    "sex.1"      "rank.2"    
# [7] "image.2"    "neuro.2"    "sex.2"      "rank.3"     "image.3"    "neuro.3"   
# [13] "sex.3"      "rank.4"     "image.4"    "neuro.4"    "sex.4" 

Thereafter we may easily use reshape.

reshape(sample_dat, idvar="subject_id", varying=2:17, direction="long", 
        new.row.names=seq(ncol(sample_dat) - 1))
#    subject_id time rank image neuro sex
# 1         sj1    1    1     a    TD   F
# 2         sj1    1    1     b    TD   F
# 3         sj1    1    1     c    TD   F
# 4         sj1    1    1     d    TD   F
# 5         sj1    2    2     e    TD   M
# 6         sj1    2    2     f    TD   M
# 7         sj1    2    2     g    TD   M
# 8         sj1    2    2     h    TD   M
# 9         sj1    3    3     i    DS   F
# 10        sj1    3    3     j    DS   F
# 11        sj1    3    3     k    DS   F
# 12        sj1    3    3     l    DS   F
# 13        sj1    4    4     m    DS   M
# 14        sj1    4    4     n    DS   M
# 15        sj1    4    4     o    DS   M
# 16        sj1    4    4     p    DS   M

Data

sample_dat <- structure(list(subject_id = structure(c(1L, 1L, 1L, 1L), .Label = "sj1", class = "factor"), 
    first_pick = c(1, 1, 1, 1), first_image_pick = structure(1:4, .Label = c("a", 
    "b", "c", "d"), class = "factor"), first_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), first_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), second_pick = c(2, 
    2, 2, 2), second_image_pick = structure(1:4, .Label = c("e", 
    "f", "g", "h"), class = "factor"), second_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "TD", class = "factor"), second_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor"), third_pick = c(3, 
    3, 3, 3), third_image_pick = structure(1:4, .Label = c("i", 
    "j", "k", "l"), class = "factor"), third_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), third_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "F", class = "factor"), fourth_pick = c(4, 
    4, 4, 4), fourth_image_pick = structure(1:4, .Label = c("m", 
    "n", "o", "p"), class = "factor"), fourth_pick_neuro = structure(c(1L, 
    1L, 1L, 1L), .Label = "DS", class = "factor"), fourth_pick_sex = structure(c(1L, 
    1L, 1L, 1L), .Label = "M", class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110