I am looking for an implementation that determines the minimum value of Gower's distance for all records in one (say, test
) data frame to any record in a second (say, training
) data frame. The result is a vector with one element for each row in test
.
The data are categorical with unordered categorical attributes, and can be generated, for example, like this:
set.seed(20130926L)
DIMS <- 12
CATS <- 2
create.data <- function(SPARSITY) {
sparse.data <- rbinom(CATS ** DIMS, 1, SPARSITY)
sparse.array <- array(sparse.data, dim=rep(CATS, DIMS))
sparse.table <- as.table(sparse.array)
sparse.df <- as.data.frame(sparse.table)
sparse.df <- subset(sparse.df, Freq > 0, select=-Freq)
sparse.df
}
data.train <- create.data(0.001)
data.test <- create.data(0.01)
head(data.train, 3)
## Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12
## 745 A A A B A B B B A B A A
## 1156 B B A A A A A B A A B A
## 1574 B A B A A B A A A B B A
summary(data.test)
## Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
## A:24 A:31 A:23 A:20 A:30 A:27 A:22 A:20 A:26 A:23
## B:24 B:17 B:25 B:28 B:18 B:21 B:26 B:28 B:22 B:25
## Var11 Var12
## A:24 A:22
## B:24 B:26
How do I find, for all rows in data.test
, the row in data.training
where Gower's distance is minimal (or at least the distance to that particular row)? The code below works, but needs too much memory already for 20 attributes or for more than 2 categories:
nrow(data.test)
## [1] 48
library(StatMatch, quietly=T, warn.conflicts=F)
apply(gower.dist(data.train, data.test), 2, min)
## [1] 0.3333 0.4167 0.2500 0.5000 0.3333 0.4167 0.2500 0.3333 0.2500 0.4167
## [11] 0.5000 0.3333 0.3333 0.3333 0.4167 0.4167 0.2500 0.4167 0.1667 0.3333
## [21] 0.4167 0.3333 0.4167 0.5000 0.3333 0.5000 0.5000 0.4167 0.3333 0.3333
## [31] 0.2500 0.4167 0.5000 0.4167 0.3333 0.5000 0.3333 0.4167 0.3333 0.3333
## [41] 0.5000 0.5833 0.5000 0.2500 0.3333 0.4167 0.3333 0.5000
The function cluster::daisy()
also returns a matrix of distances.
Similar: How to calculate Euclidean distance (and save only summaries) for large data frames. There, it is suggested to call the distance function several times for subsets of data.train
. I can do that, but the computation time is still prohibitive.
After all, the definition of Gower's distance permits a more efficient algorithm, perhaps a recursive divide-and-conquer approach that operates attribute by attribute and calls itself on subsets. Recall that Gower's distance is a (weighted) sum of attribute-wise distances, which is defined
- for categorical attributes: 0 if equal, 1 otherwise
- for ordered attributes: 0 if equal, proportional to rank distance otherwise
- for continuous attributes (not needed here): proportional to ratio of distance and range of the attribute
The following is a simple demonstration where Gower's distance between (A, A)
and all combinations of A
and B
is computed. Rows that differ on one attribute have a distance of 0.5, the row that differs on both attribute gets the maximal distance of 1.0:
(ex.train <- expand.grid(Var1=LETTERS[1:2], Var2=LETTERS[1:2]))
## Var1 Var2
## 1 A A
## 2 B A
## 3 A B
## 4 B B
ex.test <- ex.train[1, ]
gower.dist(ex.train, ex.test)
## [,1]
## [1,] 0.0
## [2,] 0.5
## [3,] 0.5
## [4,] 1.0
If both train.data
and test.data
are analyzed column-wise, a possible implementation might look like this:
- For all value levels
v
of the first column- choose subset of
test.data
where first column has valuev
- choose subset of
train.data
where first column has valuev
- call procedure recursively to obtain an upper bound for the minimum
- choose subset of
train.data
where first column has value<> v
- call procedure recursively using the previously obtained upper bound for early cut-off
- choose subset of
Is there really no implementation around, or perhaps a paper that describes such an algorithm?