There are already a number of questions about stats::reshape
, but none I found are much more understandable than the fairly incomprehensible documentation for the function itself. One user talks of migraines, another tried to make a more sensible wrapper. I also have nightmares about stats::reshape
, but sometimes there doesn't seem to be another smart way.
Here's my current situation. I'm trying to simulate some data:
#parameters
N = 5
A_trait_mean = 100
B_trait_mean = 90
C_trait_mean = 95
C_admix_A_mean = 50
C_admix_A_SD = 20
trait_SD = 10
#generate A & B
set.seed(123)
{
df = data.frame(trait_A = rnorm(N, A_trait_mean, trait_SD),
trait_B = rnorm(N, B_trait_mean, trait_SD),
admixA_C = rnorm(N, C_admix_A_mean, C_admix_A_SD))
}
#clip C values -- more than 100% admixture is impossible
df$admixA_C[df$admixA_C > 100] = 100
df$admixA_C[df$admixA_C < 0] = 0
#mate
for (row_i in 1:nrow(df)) {
tmp_admix_A = df$admixA_C[row_i]/100
df$trait_C[row_i] = df$trait_A[row_i] * tmp_admix_A + df$trait_B[row_i] * (1 - tmp_admix_A)
}
This gives:
> df
trait_A trait_B admixA_C trait_C
1 94.39524 107.15065 74.48164 97.65021
2 97.69823 94.60916 57.19628 96.37599
3 115.58708 77.34939 58.01543 99.53315
4 100.70508 83.13147 52.21365 92.30730
5 101.29288 85.54338 38.88318 91.66729
So far so good. However, to plot data like this, ggplot2 wants it in long format. However, we need the current 4 columns to become 3 new columns, not just 2. We need: a group column, a column for trait and a column for admixA.
Here's what I came up with:
> df_long = melt(df)
Using as id variables
> df_long
variable value
1 trait_A 94.39524
2 trait_A 97.69823
3 trait_A 115.58708
4 trait_A 100.70508
5 trait_A 101.29288
6 trait_B 107.15065
7 trait_B 94.60916
8 trait_B 77.34939
9 trait_B 83.13147
10 trait_B 85.54338
11 admixA_C 74.48164
12 admixA_C 57.19628
13 admixA_C 58.01543
14 admixA_C 52.21365
15 admixA_C 38.88318
16 trait_C 97.65021
17 trait_C 96.37599
18 trait_C 99.53315
19 trait_C 92.30730
20 trait_C 91.66729
This is very wrong. The admixA values are mixed in and there is no grouping variable. So I look to stats::reshape
because I have some vague memories of it being able to handle this kind of problem.
So I read the documentation and make an attempt:
df_long = reshape(df, varying = 1:4, sep = "_", direction = "long")
# 'varying' arguments must be the same length
No luck and a fairly cryptic error. Same length... as what?
Then I Google and see that some are saying that one should use a list
.
> df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long")
> df_long
time trait_A id
1.1 1 94.39524 1
2.1 1 97.69823 2
3.1 1 115.58708 3
4.1 1 100.70508 4
5.1 1 101.29288 5
1.2 2 107.15065 1
2.2 2 94.60916 2
3.2 2 77.34939 3
4.2 2 83.13147 4
5.2 2 85.54338 5
1.3 3 74.48164 1
2.3 3 57.19628 2
3.3 3 58.01543 3
4.3 3 52.21365 4
5.3 3 38.88318 5
1.4 4 97.65021 1
2.4 4 96.37599 2
3.4 4 99.53315 3
4.4 4 92.30730 4
5.4 4 91.66729 5
No error, but the group variable is wrong and it has given a wrong name to the values var (should be just "trait") and included the values from admixA.
Maybe...
df_long = reshape(df, varying = list(1:4), sep = "_", direction = "long", v.names = c("trait", "admixA"))
Error in varying[[i]] : subscript out of bounds
Guess not, but perhaps without the list
?
> df_long = reshape(df, varying = 1:4, sep = "_", direction = "long", v.names = c("trait", "admixA"))
> df_long
time trait admixA id
1.1 1 107.15065 94.39524 1
2.1 1 94.60916 97.69823 2
3.1 1 77.34939 115.58708 3
4.1 1 83.13147 100.70508 4
5.1 1 85.54338 101.29288 5
1.2 2 97.65021 74.48164 1
2.2 2 96.37599 57.19628 2
3.2 2 99.53315 58.01543 3
4.2 2 92.30730 52.21365 4
5.2 2 91.66729 38.88318 5
We are getting closer. However, the data are wrong and mixed and there is no correct grouping variable.
So after trying some other stuff more or less blind I give up and use a manual, many-step approach:
#give up and use multiple rounds of melt() manually
df_long = melt(df[c(1, 2, 4)], variable_name = "group")
#fix group variable
df_long$group = str_match(df_long$group, "_(\\w)")[, 2]
#move "value" to "trait"
df_long$trait = df_long$value; df_long$value = NULL
#add admixA
df_long$admixA = c(rep(NA, 2 * nrow(df)), df$admixA_C)
Which gives:
> df_long
group trait admixA
1 A 94.39524 NA
2 A 97.69823 NA
3 A 115.58708 NA
4 A 100.70508 NA
5 A 101.29288 NA
6 B 107.15065 NA
7 B 94.60916 NA
8 B 77.34939 NA
9 B 83.13147 NA
10 B 85.54338 NA
11 C 97.65021 74.48164
12 C 96.37599 57.19628
13 C 99.53315 58.01543
14 C 92.30730 52.21365
15 C 91.66729 38.88318
Hurray! But how does one actually use stats::reshape
to get to that result?
And what kind of better, non-migraine inducing solution should one use to do this kind of transformation?