summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

Question

I am ecologist, using mainly the vegan R package.

I have 2 matrices (sample x abundances) (See data below):

matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish) matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)

The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:

option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.

option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.

In case I am not clear, I did these 2 operations in the R codes below.

Please, could you tell me if the option 2 is correct and more appropriate than option 1.

thank you in advance.

Pierre

here is below the R code exemples

generating data

library(plyr);library(vegan)

#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
  Habitat=paste("H",gl(2,12*6),sep=""),
  Site=paste("S",gl(24,6),sep=""),
  Replicate=rep(paste("R",1:6,sep=""),24))

summary(a1.env)

a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))

names(a1.bio)=paste("F",1:15,sep="")

a1.bio[1:72,]=2*a1.bio[1:72,]

#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site

a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]

summary(a2.env)

a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))

names(a2.bio)=paste("I",1:10,sep="")

a2.bio[1:36,]=0.5*a2.bio[1:36,]


#environmental data at the sit scale

env=unique(a1.env[,c("Habitat","Site")])

env=env[order(env$Site),]

OPTION 1, averaging abundances and cbind

a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))

a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]

a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))

a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]

bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])

dist.mean=vegdist(sqrt(bio.mean),"bray")

OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix

a1.dist=vegdist(sqrt(a1.bio),"bray")

a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids

a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")

a2.dist=vegdist(sqrt(a2.bio),"bray")

a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids

a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")

summing up the two distance matrices using Gavin Simpson 's fuse()

dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))

summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)

dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)

and the 'coord.centroid' below for further distance-based analysis (is it correct ?)

coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)

COMPARING OPTION 1 AND 2

pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))

pco.centroid=cmdscale(dist.centroid)

comparison=procrustes(pco.centroid,pco.mean)

protest(pco.centroid,pco.mean)

score 4 · Answer 1 · answered Jan 25 '14 at 15:32

4

An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is

d.fused = (w * d.x) + ((1 - w) * d.y)

where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.

I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is

 train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
 train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
                      ncol = 10))
 rownames(train1) <- rownames(train2) <- LETTERS[1:10]
 colnames(train1) <- colnames(train2) <- as.character(1:10)

 d1 <- vegdist(train1, method = "bray")
 d2 <- vegdist(train2, method = "jaccard")

 dd <- fuse(d1, d2, weights = c(0.6, 0.4))
 dd
 str(dd)

This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.

analogue works closely with vegan so there won't be any issues running the two packages side by side.

answered Jan 25 '14 at 15:32

Gavin Simpson

170,508
25
396
453

Thank you @gavin-simpson. your answered to one of my (not well precised) questions, how to sum up distance matrices. I then modified my codes: from "dist.centroid=a1.dist.centroid+a2.dist.centroid" to "dist.centroid= fuse(a1.dist.centroid,a2.dist.centroid,weights=c(.5,.5))" – Pierre Jan 28 '14 at 09:55
Another part of my question was: when we want bray-curtis dissimilarities among pairs of sites, while we have a matrix whose rows are combinations of sites & replicates, what is the best solution between OPTION 1) firstly avering abundances over replicates for getting mean abundances per sites, and secondly computing the distance, versus OPTION 2) firstly computing the distance on row abundances (sites-prelicates as sample) and secondly computing distances among centroids – Pierre Jan 28 '14 at 10:12
last parts of my question. Do I use correctly the combinations of vegdist and betadisper for getting distances among centroids (see my post)? And is cmdscale(a.distance.matrix , k= n-1) may be feed as coordinates of n sites, to an ordination method such as RDA in order to do a distance-based RDA. I know that vegan::capscale() is the most appropriate for this example but I want to use a home made function that use euclidean distance – Pierre Jan 28 '14 at 10:20
If you are doing this by site & have n replicates per site, then I suppose centroids is one way of doing this and you are computing the distance between them almost correctly. I think you want `method = "euclidean"` in the `vegdist()` as the Euclidean distance between points in PCoA space (taking all PCoA dimensions) is equivalent to the Bray Curtis between those points (if Bray Curtis was the original dissimilarity matrix). There is a slight complexity of Bray Curtis not being fully embeddable in an Euclidean space, but I think that issue is minor. – Gavin Simpson Jan 28 '14 at 14:53
To be honest though, I'm not convinced I'd throw away all the replicate information. Vegan (in combination with the permute pkg) can handle replicate data like this. If you include `Site` as a variable (which will be constant within sites) you get the between site variance explained, and using a permutation test, you should be able to get away with simple randomisation. If you are testing the effect of variables that vary at the Site level only, then you'll need to permute Sites and leave within-site ordering fixed. – Gavin Simpson Jan 28 '14 at 15:02
Thank you @Gavin Simpson for your reply. I can't work with replicates since for one site, I have 6 replicates for fish, and 3 replicates for macroinvertebrates (not belonging to the 6 formers, I used 2 distinct sampling methods). This is why for merging fish and invertebrates abundances data, I need to work at the site scale. Moreover, I will relate this fish-macro-invertebrate dissimilarity matrix to some macrolgae biomasses (a third matrice) that I estimated using a third method (3 other replicates per sites). Hence, the resolution where all my data match is the site scale. – Pierre Jan 29 '14 at 18:35
@Pierre You need to comment on Jari's Answer for the @ notifications to work. He won't see them if you comment on my answer. – Gavin Simpson Jan 29 '14 at 18:36
One last element: The sampling area of a single replicate is quiet small relatively to the spatial variation of most of the sampled species (this was due to methodological constraints). Abscence of a species in 4 out of 6 replicates within a site may therefore be due to low abundance (despite the species, not 'rare', was present) and/or due to fine spatial scale interactions among fish individuals (e.g. prey fish avoids predator attack range). From this perspective, averaging abundances before computing bray-curtis (option 1) seems better?! – Pierre Jan 29 '14 at 18:49
Understood, I will move my reply. Sorry, I am new on this forum. Thank you for warning/teaching me – Pierre Jan 29 '14 at 18:51
reading my last comment, I think I forgot to mention something essential, my aim. I want to compare fish-macroinvertebrates assemblage between habitats (groups of sites). I am not interested in fine scale spatial variability. I think to use adonis, however I have a 2 ways (1 random, 1 fixed) crossed design with one empty cell and some other cells not balanced (n between 4 and 6). I am afraid adonis is not appropriate for that ?! – Pierre Jan 29 '14 at 19:23

Jari Oksanen · Answer 2 · 2014-01-29T10:33:10.633

The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.

This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.

The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.

If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).

Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.

I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:

library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1]  50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)

I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.

Thank @Jari Oksanen for your demonstration. Since my distances among centroids are euclidean, I should better use "Jari's formula": sqrt(dist.a^2 + dist.b^2), the formulae that I should know from High School, I am not proud of that :-). — Pierre, Jan 29 '14 at 18:51
Dear @Jari Oksanen , leaving aside sums of matrices, could you please confirm I get your point: You think it is better to 1/ average abundances over replicates and then compute bray-curtis dissimilarity among pairs of sites, compared to 2/ compute dissimilarity among pairs of replicates and then compute distance among site centroids. Because with 2/, we average the distances from replicates to their site-centroids (or something like that) and this is geometrically not correct. — Pierre, Jan 29 '14 at 18:52
Yes, I think it is better. However, read what Gavin Simpson wrote in his comment: "To be honest though, I'm not convinced I'd throw away all the replicate information. If you include Site as a variable (which will be constant within sites) you get the between site variance explained..." This concerns both averaging distances and averaging data. Depends what you want to do with your dissimilarities, but say, `adonis(dis ~ Site + ` will be geometrically correct for all dissimilarities, remove the centroid, and retain the information on replicate variation. — Jari Oksanen, Jan 30 '14 at 11:33
Also note that although unbounded dissimilarities can be added up (unlike bounded), they are not generally good for analysing community composition. Normally it is better to use Bray-Curtis than Euclidean distances (no spaces to expand here, but check Beals's classic on mathematical elegance and ecological naivete to get the spirit). However, there is no need to average Bray-Curtis since you **can** average data. It is even better if you do not need to average the data, but you can do like @Gavin Simpson suggested and use clustering as a variable, and take that into account in permutations. — Jari Oksanen, Jan 30 '14 at 11:44

philshem · Answer 3 · 2017-06-06T13:11:10.870

I like this simplicity of this answer, but it only applies to adding 2 distance matrices:

d.fused = (w * d.x) + ((1 - w) * d.y)

so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:

# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))

# create new distance matrix with first element of array
dst <- dst_array[[1]]

# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
  dst <- dst + dst_array[[jj]]
}

You could also use a vector of similar size to dst_array in order to define scaling factors

dst <- dst + my_scale[[jj]] * dst_array[[jj]]