0

There are already a number of questions about stats::reshape, but none I found are much more understandable than the fairly incomprehensible documentation for the function itself. One user talks of migraines, another tried to make a more sensible wrapper. I also have nightmares about stats::reshape, but sometimes there doesn't seem to be another smart way.

Here's my current situation. I'm trying to simulate some data:

#parameters
N = 5
A_trait_mean = 100
B_trait_mean = 90
C_trait_mean = 95
C_admix_A_mean = 50
C_admix_A_SD = 20
trait_SD = 10

#generate A & B
set.seed(123)
{
df = data.frame(trait_A = rnorm(N, A_trait_mean, trait_SD),
                trait_B = rnorm(N, B_trait_mean, trait_SD),
                admixA_C = rnorm(N, C_admix_A_mean, C_admix_A_SD))
}


#clip C values -- more than 100% admixture is impossible
df$admixA_C[df$admixA_C > 100] = 100
df$admixA_C[df$admixA_C < 0] = 0

#mate
for (row_i in 1:nrow(df)) {
  tmp_admix_A = df$admixA_C[row_i]/100
  df$trait_C[row_i] = df$trait_A[row_i] * tmp_admix_A + df$trait_B[row_i] * (1 - tmp_admix_A)
}

This gives:

> df
    trait_A   trait_B admixA_C  trait_C
1  94.39524 107.15065 74.48164 97.65021
2  97.69823  94.60916 57.19628 96.37599
3 115.58708  77.34939 58.01543 99.53315
4 100.70508  83.13147 52.21365 92.30730
5 101.29288  85.54338 38.88318 91.66729

So far so good. However, to plot data like this, ggplot2 wants it in long format. However, we need the current 4 columns to become 3 new columns, not just 2. We need: a group column, a column for trait and a column for admixA.

Here's what I came up with:

> df_long = melt(df)
Using  as id variables
> df_long
   variable     value
1   trait_A  94.39524
2   trait_A  97.69823
3   trait_A 115.58708
4   trait_A 100.70508
5   trait_A 101.29288
6   trait_B 107.15065
7   trait_B  94.60916
8   trait_B  77.34939
9   trait_B  83.13147
10  trait_B  85.54338
11 admixA_C  74.48164
12 admixA_C  57.19628
13 admixA_C  58.01543
14 admixA_C  52.21365
15 admixA_C  38.88318
16  trait_C  97.65021
17  trait_C  96.37599
18  trait_C  99.53315
19  trait_C  92.30730
20  trait_C  91.66729

This is very wrong. The admixA values are mixed in and there is no grouping variable. So I look to stats::reshape because I have some vague memories of it being able to handle this kind of problem.

So I read the documentation and make an attempt:

df_long = reshape(df, varying = 1:4, sep = "_", direction = "long")
# 'varying' arguments must be the same length

No luck and a fairly cryptic error. Same length... as what?

Then I Google and see that some are saying that one should use a list.

> df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long")
> df_long
    time   trait_A id
1.1    1  94.39524  1
2.1    1  97.69823  2
3.1    1 115.58708  3
4.1    1 100.70508  4
5.1    1 101.29288  5
1.2    2 107.15065  1
2.2    2  94.60916  2
3.2    2  77.34939  3
4.2    2  83.13147  4
5.2    2  85.54338  5
1.3    3  74.48164  1
2.3    3  57.19628  2
3.3    3  58.01543  3
4.3    3  52.21365  4
5.3    3  38.88318  5
1.4    4  97.65021  1
2.4    4  96.37599  2
3.4    4  99.53315  3
4.4    4  92.30730  4
5.4    4  91.66729  5

No error, but the group variable is wrong and it has given a wrong name to the values var (should be just "trait") and included the values from admixA.

Maybe...

df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long", v.names = c("trait", "admixA"))
Error in varying[[i]] : subscript out of bounds

Guess not, but perhaps without the list?

> df_long = reshape(df, varying = 1:4, sep = "_", direction = "long", v.names = c("trait", "admixA"))
> df_long
    time     trait    admixA id
1.1    1 107.15065  94.39524  1
2.1    1  94.60916  97.69823  2
3.1    1  77.34939 115.58708  3
4.1    1  83.13147 100.70508  4
5.1    1  85.54338 101.29288  5
1.2    2  97.65021  74.48164  1
2.2    2  96.37599  57.19628  2
3.2    2  99.53315  58.01543  3
4.2    2  92.30730  52.21365  4
5.2    2  91.66729  38.88318  5

We are getting closer. However, the data are wrong and mixed and there is no correct grouping variable.

So after trying some other stuff more or less blind I give up and use a manual, many-step approach:

#give up and use multiple rounds of melt() manually
df_long = melt(df[c(1, 2, 4)], variable_name = "group")
#fix group variable
df_long$group = str_match(df_long$group, "_(\\w)")[, 2]
#move "value" to "trait"
df_long$trait = df_long$value; df_long$value = NULL
#add admixA
df_long$admixA = c(rep(NA, 2 * nrow(df)), df$admixA_C)

Which gives:

> df_long
   group     trait   admixA
1      A  94.39524       NA
2      A  97.69823       NA
3      A 115.58708       NA
4      A 100.70508       NA
5      A 101.29288       NA
6      B 107.15065       NA
7      B  94.60916       NA
8      B  77.34939       NA
9      B  83.13147       NA
10     B  85.54338       NA
11     C  97.65021 74.48164
12     C  96.37599 57.19628
13     C  99.53315 58.01543
14     C  92.30730 52.21365
15     C  91.66729 38.88318

Hurray! But how does one actually use stats::reshape to get to that result?

And what kind of better, non-migraine inducing solution should one use to do this kind of transformation?

Community
  • 1
  • 1
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • 1
    Just use `stack` - `cbind(df[3],stack(df[c(1,2,4)]))` or `reshape(df, direction="long", varying=c(1,2,4), sep="_", timevar="group")` – thelatemail Nov 06 '15 at 06:31

2 Answers2

2

reshape can sort this out for you like:

reshape(df, direction="long", varying=c(1,2,4), sep="_", timevar="group")
#    admixA_C group     trait id
#1.A 74.48164     A  94.39524  1
#2.A 57.19628     A  97.69823  2
#3.A 58.01543     A 115.58708  3
#4.A 52.21365     A 100.70508  4
#5.A 38.88318     A 101.29288  5
#1.B 74.48164     B 107.15065  1
#2.B 57.19628     B  94.60916  2
#...

You don't have an idvar, so don't specify one.

varying only lists the 3 columns which need to be made "long"

Specifying sep="_" means that reshape will guess the categories for a time variable and that your 3 related "trait_A|B|C" variables all represent a "trait".

timevar="group" just gives an appropriate label to your grouping variable.

Since admixA_C doesn't vary, it just gets repeated in the remaining column.

You can throw out the id column since you don't need it now.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
1

Same thing using melt(...), although because of the way you've named your columns, the reshape(...) solution might be a little shorter.

library(reshape2)
df.melt <- melt(df, measure.vars=c(1,2,4), variable.name="group", value.name="trait")
df.melt <- transform(df.melt, group=gsub("^.*_","",group))
df.melt <- transform(df.melt, admixA_C=ifelse(group=="C",admixA_C,NA))
df.melt
#    admixA_C group     trait
# 1        NA     A  94.39524
# 2        NA     A  97.69823
# 3        NA     A 115.58708
# 4        NA     A 100.70508
# 5        NA     A 101.29288
# 6        NA     B 107.15065
# 7        NA     B  94.60916
# 8        NA     B  77.34939
# 9        NA     B  83.13147
# 10       NA     B  85.54338
# 11 74.48164     C  97.65021
# 12 57.19628     C  96.37599
# 13 58.01543     C  99.53315
# 14 52.21365     C  92.30730
# 15 38.88318     C  91.66729

As an aside, you can replace your for loop with:

df$trait_C <- with(df, trait_A*admixA_C/100 + trait_B*(1-admixA_C/100))
jlhoward
  • 58,004
  • 7
  • 97
  • 140