I'm asking this question because even though there are many similar questions on this website (like this, this, and this), none of them are exactly my situation. Actually, this link is asking the same question as mine, but the answer there is unclear to me and raises the question that I am about to ask.
I have a dataset from which I am constructing a stacked barplot, and I wan't to know how I can arrange the stacked barplot where "similar" individuals cluster together. I work in bioinformatics, and here is the dataset which is a d-by-n matrix. In this toy dataset, there are d=10 ancestral populations and n = 5 individuals:
> a
V1 V2 V3 V4 V5
1 0.534410243 0.009358740 0.011295181 0.2141751740 0.0030129254
2 0.026653603 0.372426720 0.447847534 0.0179177507 0.4072904477
3 0.193317915 0.003605024 0.003186611 0.4832114736 0.0007095471
4 0.111881585 0.000000000 0.000000000 0.2296213741 0.0119233461
5 0.089696570 0.591163629 0.509774416 0.0032542030 0.5535847030
6 0.007543558 0.000000000 0.000000000 0.0364907757 0.0013148362
7 0.004862942 0.000000000 0.002123909 0.0146682272 0.0004053690
8 0.009276195 0.011710457 0.014367894 0.0000000000 0.0000000000
9 0.006903171 0.004314528 0.011404455 0.0000000000 0.0126889937
10 0.015454219 0.007420903 0.000000000 0.0006610215 0.0090698319
All columns add up to 1. I create a stacked barplot like so:
pop <- rownames(a)
a <- a %>% mutate(pop = rownames(a))
a_long <- gather(a, key, value, -pop)
# trying to create a similarity index
a_long <- a_long %>% group_by(key) %>%
mutate(mean = mean(value)) %>%
arrange(desc(mean))
# looking at some of the expanded dataset
> a_long[1:20,]
# A tibble: 20 x 4
# Groups: key [2]
pop key value mean
<chr> <chr> <dbl> <dbl>
1 1 V2 0.00936 0.1
2 2 V2 0.372 0.1
3 3 V2 0.00361 0.1
4 4 V2 0 0.1
5 5 V2 0.591 0.1
6 6 V2 0 0.1
7 7 V2 0 0.1
8 8 V2 0.0117 0.1
9 9 V2 0.00431 0.1
10 10 V2 0.00742 0.1
11 1 V4 0.214 0.1
12 2 V4 0.0179 0.1
13 3 V4 0.483 0.1
14 4 V4 0.230 0.1
15 5 V4 0.00325 0.1
16 6 V4 0.0365 0.1
17 7 V4 0.0147 0.1
18 8 V4 0 0.1
19 9 V4 0 0.1
20 10 V4 0.000661 0.1
# colors
v_colors <- c("#440154FF", "#443B84FF", "#34618DFF", "#404588FF", "#1FA088FF", "#40BC72FF",
"#67CC5CFF", "#A9DB33FF", "#DDE318FF", "#FDE725FF")
plot <- ggplot(a_long, aes(x = key, y = value, fill = pop))
plot + geom_bar(position="stack", stat="identity") + scale_fill_manual(values = v_colors)
How can I make the output look more neat, e.g. with the individuals with higher proportion of population 5 ancestry be next to each other on the x-axis? So far, I have tried to compute the "mean" of value of each individual, but it didn't work since it's not a good measure. How can I create a similarity index that tells me how similar individual 1 is to individual 2, and then how do I order it them on the x-axis so that they look well-clustered (e.g. like the barplots in this figure)?
Thanks!
One last thing: if you want to re-create the dataset a
, here is the code:
v1 = c(0.534410243, 0.026653603, 0.193317915, 0.111881585, 0.089696570, 0.007543558, 0.004862942, 0.009276195, 0.006903171, 0.015454219)
v2 = c(0.009358740, 0.372426720, 0.003605024, 0.000000000, 0.591163629, 0.000000000, 0.000000000, 0.011710457, 0.004314528, 0.007420903)
v3 = c(0.011295181, 0.447847534, 0.003186611, 0.000000000, 0.509774416, 0.000000000, 0.002123909, 0.014367894, 0.011404455, 0.000000000)
v4 = c(0.2141751740, 0.0179177507, 0.4832114736, 0.2296213741, 0.0032542030, 0.0364907757, 0.0146682272, 0.0000000000, 0.0000000000, 0.0006610215)
v5 = c(0.0030129254, 0.4072904477, 0.0007095471, 0.0119233461, 0.5535847030, 0.0013148362, 0.0004053690, 0.0000000000, 0.0126889937, 0.0090698319)
a = data.frame(V1 = v1, V2 = v2, V3 = v3, V4 = v4, V5 = v5)