dist function in r (stats) for clustering: Should I put my ID variable in row.names?

Question

I have a data frame with some numeric columns and an ID column which is character. When I pass the whole data frame in the dist function it calculates the distance matrix, but when I remove the ID column and pass it to the distance function I do not get the same result.
1) Why this strange behavior?
2) How should one handle the "ID" column in clustering in R? should I drop the ID column or should I put them in row.names.

PS I usually use tibbles and the tools in the tidyverse.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Dec 17 '18 at 17:58
Yes, **obviously** you should *only* pass those columns to the functions that you want to use... — Has QUIT--Anony-Mousse, Dec 17 '18 at 20:20

Julius Vainora · Accepted Answer · 2018-12-18T09:44:45.273

It is not obvious what is going to happen when we pass a data frame containing factor/character variables to dist.

First, if it's a character of numeric data, such as c("1", "2"), then it will be coerced back to numeric data. In that case, unless differences between ID's have a meaning, you should clearly not include this variable.

Now let's consider the question what happens if we have a factor of a character not of this type as above. In the C source code we find some important lines:

static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist;
    int count, j;

    count= 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        dev = (x[i1] - x[i2]);
        if(!ISNAN(dev)) {
        dist += dev * dev;
        count++;
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return sqrt(dist);
}

First (not in this function), factor/character variables get coerced into NA, when trying to convert them to integers. (The warning message also says that.) As a result, as we see in the code of R_euclidean, we have some rescaling:

if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);

where nc is the total number of columns and count is the number of numerical columns. We may verify this:

k <- 20
df <- data.frame(a = sample(letters, k, replace = TRUE), 
                 b = sample(letters, k, replace = TRUE), 
                 c = rnorm(k), d = rnorm(k))

max(abs(as.matrix(dist(df)) * sqrt(2 / ncol(df)) - as.matrix(dist(df[, 3:4]))))
# [1] 7.467696e-09

That is, we compared the distance matrix of df without rescaling (multiplication by sqrt(2 / ncol(df))) and the distance matrix without the two factor variables. There seem to be some numerical errors but the matrices are basically the same.

Hence, this explains why the results are different. If you are going to use a single matrix for, say, clustering, leaving factors/characters seems to be fine, since scale shouldn't matter. However, in cases where scale matters, you should drop the factor/character columns first. (Whether to use your ID variable as row names or as a separate vector doesn't matter and is up to you.)

@ Julius: Thanks, but my character ID is not a factor. So it gets converted to a factor? — xhr489, Dec 17 '18 at 19:21
@ Julius. Sorry for late reply. I think the problem was that my id was character but it could get coerced into numeric because they were numbers read as character, and it did not print any message about this. A true character get coerced into NA... It is pretty neat to read the source code. — xhr489, Dec 18 '18 at 09:06
@David, yes, you are right, numbers in character format get coerced back in numbers. — Julius Vainora, Dec 18 '18 at 09:42

dist function in r (stats) for clustering: Should I put my ID variable in row.names?

1 Answers1