Minimum Gower's distance for two data frames

Question

I am looking for an implementation that determines the minimum value of Gower's distance for all records in one (say, test) data frame to any record in a second (say, training) data frame. The result is a vector with one element for each row in test.

The data are categorical with unordered categorical attributes, and can be generated, for example, like this:

set.seed(20130926L)
DIMS <- 12
CATS <- 2

create.data <- function(SPARSITY) {
  sparse.data <- rbinom(CATS ** DIMS, 1, SPARSITY)
  sparse.array <- array(sparse.data, dim=rep(CATS, DIMS))
  sparse.table <- as.table(sparse.array)
  sparse.df <- as.data.frame(sparse.table)
  sparse.df <- subset(sparse.df, Freq > 0, select=-Freq)
  sparse.df
}

data.train <- create.data(0.001)
data.test <- create.data(0.01)

head(data.train, 3)

##      Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12
## 745     A    A    A    B    A    B    B    B    A     B     A     A
## 1156    B    B    A    A    A    A    A    B    A     A     B     A
## 1574    B    A    B    A    A    B    A    A    A     B     B     A

summary(data.test)

##  Var1   Var2   Var3   Var4   Var5   Var6   Var7   Var8   Var9   Var10 
##  A:24   A:31   A:23   A:20   A:30   A:27   A:22   A:20   A:26   A:23  
##  B:24   B:17   B:25   B:28   B:18   B:21   B:26   B:28   B:22   B:25  
##  Var11  Var12 
##  A:24   A:22  
##  B:24   B:26

How do I find, for all rows in data.test, the row in data.training where Gower's distance is minimal (or at least the distance to that particular row)? The code below works, but needs too much memory already for 20 attributes or for more than 2 categories:

nrow(data.test)

## [1] 48

library(StatMatch, quietly=T, warn.conflicts=F)
apply(gower.dist(data.train, data.test), 2, min)

##  [1] 0.3333 0.4167 0.2500 0.5000 0.3333 0.4167 0.2500 0.3333 0.2500 0.4167
## [11] 0.5000 0.3333 0.3333 0.3333 0.4167 0.4167 0.2500 0.4167 0.1667 0.3333
## [21] 0.4167 0.3333 0.4167 0.5000 0.3333 0.5000 0.5000 0.4167 0.3333 0.3333
## [31] 0.2500 0.4167 0.5000 0.4167 0.3333 0.5000 0.3333 0.4167 0.3333 0.3333
## [41] 0.5000 0.5833 0.5000 0.2500 0.3333 0.4167 0.3333 0.5000

The function cluster::daisy() also returns a matrix of distances.

Similar: How to calculate Euclidean distance (and save only summaries) for large data frames. There, it is suggested to call the distance function several times for subsets of data.train. I can do that, but the computation time is still prohibitive.

After all, the definition of Gower's distance permits a more efficient algorithm, perhaps a recursive divide-and-conquer approach that operates attribute by attribute and calls itself on subsets. Recall that Gower's distance is a (weighted) sum of attribute-wise distances, which is defined

for categorical attributes: 0 if equal, 1 otherwise
for ordered attributes: 0 if equal, proportional to rank distance otherwise
for continuous attributes (not needed here): proportional to ratio of distance and range of the attribute

The following is a simple demonstration where Gower's distance between (A, A) and all combinations of A and B is computed. Rows that differ on one attribute have a distance of 0.5, the row that differs on both attribute gets the maximal distance of 1.0:

(ex.train <- expand.grid(Var1=LETTERS[1:2], Var2=LETTERS[1:2]))

##   Var1 Var2
## 1    A    A
## 2    B    A
## 3    A    B
## 4    B    B

ex.test <- ex.train[1, ]
gower.dist(ex.train, ex.test)

##      [,1]
## [1,]  0.0
## [2,]  0.5
## [3,]  0.5
## [4,]  1.0

If both train.data and test.data are analyzed column-wise, a possible implementation might look like this:

For all value levels v of the first column
1. choose subset of test.data where first column has value v
2. choose subset of train.data where first column has value v
3. call procedure recursively to obtain an upper bound for the minimum
4. choose subset of train.data where first column has value <> v
5. call procedure recursively using the previously obtained upper bound for early cut-off

Is there really no implementation around, or perhaps a paper that describes such an algorithm?

You might find this helpful: http://stackoverflow.com/a/16670220/1201032 — flodel, Sep 29 '13 at 22:44
This looks to be a request for a package or method recommendation. Where is the evidence that a "recursive approach" would be helpful? Can you show that the `gower.dist` function is intrinsically inefficient? You still need to calculate distances for all N x M pairings. I do think you have used it incorrectly, though. Shouldn't it be just: `ddat <- gower.dist(data.train, data.test); which(ddat==min(ddat), arr.ind=TRUE)` ? (It's already designed to do the "apply" operation itself.) — IRTFM, Sep 30 '13 at 01:26
In the context of R, I also think this question looks like a tool request; there shouldn't be much improvement possible except by finding another implementation of `gower.dist`, I guess. Maybe the OP can tag it with [tag:algorithm], describe what gower distance actually is and how `gower.dist` works and is inefficient (or "permits a more efficient algorithm"), and get some help from there. — Frank, Sep 30 '13 at 02:18
@flodel: I'm looking for a vector that contains, for each `test` row, the closest `train` row. The question you link to is about minimum overall distance. — krlmlr, Sep 30 '13 at 07:25
@DWin: Indeed it's necessary to iterate over the entire `train` data, and also through the entire `test` data. However, I believe that not all pairs `test` X `train` need to be considered if you are looking for a *minimum* distance. If this question doesn't have a satisfactory answer, I might come up with an implementation and a description. — krlmlr, Sep 30 '13 at 07:29
@krlmlr, read that link again. It is finding the nearest neighbor for each point. The only difference I suppose is that it uses euclidean distance, but maybe the idea of a K-d tree can be leveraged here. — flodel, Sep 30 '13 at 11:36
To help with our answers, you should provide your data dimensions and `gower.dist` (or tell us how is the distance defined.) — flodel, Sep 30 '13 at 11:39
@flodel: I have provided a toy example that shows `gower.dist` in action. Besides that, the code runs, as is, on my machine if current R and `StatMatch` are installed. It provides everything necessary: a data generator, a (slow) implementation that produces the result as desired, and even parameters to tweak the process. What else is there to add? — krlmlr, Sep 30 '13 at 12:21

mrip · Accepted Answer · 2013-10-05T19:00:14.090

I'm not familiar with Gower's distance, but from what you describe, it appears that, for unordered categorical attributes, Gower's distance is equivalent to the Hamming distance divided by the length of the vector. In other words, the Gower distance between vectors x and y is simply mean(x!=y). In this situation, you can save a significant amount of computation time by avoiding computing the entire distance matrix, and instead using colSums. Here is an example, with three levels and 10000 training rows:

> set.seed(123)
> train.rows<-10000
> test.rows<-100
> cols<-20
> levels<-c("a","b","c")
> train.set<-sample(levels,train.rows*cols,T)
> dim(train.set)<-c(train.rows,cols)
> test.set<-sample(levels,test.rows*cols,T)
> dim(test.set)<-c(test.rows,cols)
> system.time(gdist<-apply(gower.dist(train.set,test.set),2,min))   
   user  system elapsed 
 13.396   0.324  13.745 
> system.time(hdist<-apply(test.set,1,function(x) min(colSums(x!=t(train.set))/cols)))
   user  system elapsed 
  0.492   0.008   0.504 
> identical(hdist,gdist)
[1] TRUE

If the data is not discrete and unordered, then the formula for Gower's distance is different, but I suspect that there is a similar way to compute this more efficiently without computing the entire distance matrix via gower.dist.

Update: this can be made more efficient by using @Frank's suggestion, and generating t(train.set) upfront rather than within the function:

require(microbenchmark)
ttrain.set<-t(train.set)
microbenchmark(
  a=apply(test.set,1,function(x) min(colSums(x!=t(train.set))/cols)),
  b=apply(test.set,1,function(x) min(colSums(x!=ttrain.set)/cols)))

## Unit: milliseconds
##  expr      min       lq   median       uq      max neval
##     a 523.3781 533.2950 589.0048 620.4411 725.0183   100
##     b 367.5428 371.6004 396.7590 408.9804 496.4001   100

This might be faster, but still computes distances for all data pairs, right? — krlmlr, Oct 01 '13 at 06:04
Yes, it computes all N*M pairs, though it avoids allocating the distance matrix, which should help with the memory issue you mentioned in the OP. This (or something similar) is likely the best you are going to be able to do without writing external C code and implementing a complicated algorithm using some kind of tree data structure (for example http://www.vldb.org/conf/2003/papers/S19P02.pdf http://www.cse.msu.edu/~pramanik/research/papers/papers/tois.10.pdf). Finding nearest neighbors efficiently is an open research problem, and it is harder in discrete spaces than continuous ones. — mrip, Oct 01 '13 at 11:07
Thanks a lot for the links. This looks a *slight* bit more difficult than I expected :-) — krlmlr, Oct 01 '13 at 13:01
+1 Nice. If you only care about `which.min`, instead of the actual distance, there's no need to divide by the length; the sum should be enough, right? Also, would it be faster to just make a transposed copy of the training data instead of doing that operation within the function? — Frank, Oct 03 '13 at 02:01

IRTFM · Answer 2 · 2013-09-30T02:28:33.497

0

I had this a part of my comment but it's really a candidate as an answer unless I missed the point of question: Shouldn't it be just:

 ddat <- gower.dist(data.train, data.test)
 which(ddat==min(ddat), arr.ind=TRUE)

#     row col
#[1,]   3  19

? (It's already designed to do the "apply" operation itself.)

If the goal is to get the min dist to a particular row in 'data.test' then it would just be even faster and take less space. I'm still not figuring out why this is causing memory difficulties. And is the goal to find the minimum distances or to find which one is the minimum for each data.test row.

edited Sep 30 '13 at 02:28

answered Sep 30 '13 at 01:29

IRTFM

258,963
21
364
487

1

Should be for each row, since the OP said "for all rows in data.test, the row in data.training that has the minimum Gower distance (or at least the distance to that particular row)". – Frank Sep 30 '13 at 02:15
The memory problems arise because several matrices of size `data.train` x `data.test` are created. – krlmlr Sep 30 '13 at 07:44

Minimum Gower's distance for two data frames

2 Answers2