It is not obvious what is going to happen when we pass a data frame containing factor/character variables to dist
.
First, if it's a character of numeric data, such as c("1", "2")
, then it will be coerced back to numeric data. In that case, unless differences between ID's have a meaning, you should clearly not include this variable.
Now let's consider the question what happens if we have a factor of a character not of this type as above. In the C source code we find some important lines:
static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
double dev, dist;
int count, j;
count= 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(both_non_NA(x[i1], x[i2])) {
dev = (x[i1] - x[i2]);
if(!ISNAN(dev)) {
dist += dev * dev;
count++;
}
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);
}
First (not in this function), factor/character variables get coerced into NA, when trying to convert them to integers. (The warning message also says that.) As a result, as we see in the code of R_euclidean
, we have some rescaling:
if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);
where nc
is the total number of columns and count
is the number of numerical columns. We may verify this:
k <- 20
df <- data.frame(a = sample(letters, k, replace = TRUE),
b = sample(letters, k, replace = TRUE),
c = rnorm(k), d = rnorm(k))
max(abs(as.matrix(dist(df)) * sqrt(2 / ncol(df)) - as.matrix(dist(df[, 3:4]))))
# [1] 7.467696e-09
That is, we compared the distance matrix of df
without rescaling (multiplication by sqrt(2 / ncol(df))
) and the distance matrix without the two factor variables. There seem to be some numerical errors but the matrices are basically the same.
Hence, this explains why the results are different. If you are going to use a single matrix for, say, clustering, leaving factors/characters seems to be fine, since scale shouldn't matter. However, in cases where scale matters, you should drop the factor/character columns first. (Whether to use your ID variable as row names or as a separate vector doesn't matter and is up to you.)