How to import a distance matrix for clustering in R

Question

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:

1    2    1.2323
1    3    6.4862
1    4    4.4789
1    5    3.6476 
.
.

All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.

I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!

score 0 · Answer 1 · answered Jun 14 '17 at 10:44

As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.

Then you should look at the reading files, read.csv, or fread.

Example code:

dt <- read.csv(file, sep = "", header = TRUE)

I suggest using data.table package. Then:

setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]

This creates a new variable out of the first and the second model and serves as a unique id.

What I do is then removing this id and scale the numerical input. After scaling, run clustering algorithms.

Merge the result with the id to analyse your results.

Is that what you are looking for?

KenHBS · Accepted Answer · 2017-06-14T11:56:20.307

Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:

(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)

# Create some dummy data that has the same shape as your data:
  df          <- expand.grid(model1 = 1:120, model2 = 2:120)
  df$distance <- runif(n = 119*120, min = 1, max = 10)
  head(df)
  # model1 model2 distance
  # 1        2    7.958746
  # 2        2    1.083700
  # 3        2    9.211113
  # 4        2    5.544380
  # 5        2    5.498215
  # 6        2    1.520450

inds <- seq(0, 200*119, by = 200) 
val  <- c(df$distance, rep(0, length(inds)))

inds <- c(seq_along(df$distance), inds + 0.5)
val  <- val[order(inds)]

Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:

matrix(val, ncol = 200, nrow = 200)

Edit:

When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.

dist_mat                      <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance

To copy the upper-triangular entries to below the diagonal, use:

dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]

I tried this and it works, but the problem is that for the matrix, both (for instance) 56-112 and 112-56 should be in the matrix, even though they are the same number. But all the duplicates are not in the file so the matrix generated does not seem to be correct :( — Matthijs van Kesteren, Jun 14 '17 at 11:36
@MatthijsvanKesteren I edited my answer. The part after "Edit" is now only relevant for your case — KenHBS, Jun 14 '17 at 11:57

How to import a distance matrix for clustering in R

2 Answers2

Edit: