implementation of the Gower distance function

Question

I have a matrix (size: 28 columns and 47 rows) with numbers. This matrix has an extra row that is contains headers for the columns ("ordinal" and "nominal").

I want to use the Gower distance function on this matrix. Here says that:

The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:

    d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )

In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:

factor or character columns are considered as categorical nominal variables and d_ijk = 0 if

x_ik =x_jk, 1 otherwise;
ordered columns are considered as categorical ordinal variables and
the values are substituted with the
corresponding position index, r_ik in the factor levels. These position
indexes (that are different from the output of the R function rank) are
transformed in the following manner

z_ik = (r_ik - 1)/(max(r_ik) - 1)

These new values, z_ik, are treated as observations of an
interval scaled variable.

As far as the weight delta_ijk is concerned:

delta_ijk = 0 if x_ik = NA or x_jk = NA;
delta_ijk = 1 in all the other cases.

I know that there is a gower.dist function, but I must do it that way. So, for "d_ijk", "delta_ijk" and "z_ik", I tried to make functions, as I didn't find a better way.

I started with "delta_ijk" and i tried this:

Delta=function(i,j){for (i in 1:28){for (j in 1:47){  
+{if (MyHeader[i,j]=="nominal")
+ result=0
+{else if (MyHeader[i,j]=="ordinal") result=1}}}}
+;result}

But I got error. So I got stuck and I can't do the rest.

P.S. Excuse me if I make mistakes, but English is not a language I very often.

can you repost your data as a zip or a tar.gz file. My linux box won't open rar archives without me going to lengths to find out what application will open them. I'd be happy to take a look if you do so. — Gavin Simpson, Nov 29 '10 at 12:11

score 3 · Answer 1 · answered Nov 28 '10 at 22:54

Why do you want to reinvent the wheel billyt? There are several functions/packages in R that will compute this for you, including daisy() in package cluster which comes with R.

First things first though, get those "data type" headers out of your data. If this truly is a matrix then character information in this header row will make the whole matrix a character matrix. If it is a data frame, then all columns will likely be factors. What you want to do is code the type of data in each column (component of your data frame) as 'factor' or 'ordered'.

df <- data.frame(A = c("ordinal",1:3), B = c("nominal","A","B","A"),
                 C = c("nominal",1,2,1))

Which gives this --- note that all are stored as factors because of the extra info.

> head(df)
        A       B       C
1 ordinal nominal nominal
2       1       A       1
3       2       B       2
4       3       A       1
> str(df)
'data.frame':   4 obs. of  3 variables:
 $ A: Factor w/ 4 levels "1","2","3","ordinal": 4 1 2 3
 $ B: Factor w/ 3 levels "A","B","nominal": 3 1 2 1
 $ C: Factor w/ 3 levels "1","2","nominal": 3 1 2 1

If we get rid of the first row and recode into the correct types, we can compute Gower's coefficient easily.

> headers <- df[1,]
> df <- df[-1,]
> DF <- transform(df, A = ordered(A), B = factor(B), C = factor(C))
> ## We've previously shown you how to do this (above line) for lots of columns!
> str(DF)
'data.frame':   3 obs. of  3 variables:
 $ A: Ord.factor w/ 3 levels "1"<"2"<"3": 1 2 3
 $ B: Factor w/ 2 levels "A","B": 1 2 1
 $ C: Factor w/ 2 levels "1","2": 1 2 1
> require(cluster)
> daisy(DF)
Dissimilarities :
          2         3
3 0.8333333          
4 0.3333333 0.8333333

Metric :  mixed ;  Types = O, N, N 
Number of objects : 3

Which gives the same as gower.dist() for this data (although in a slightly different format (as.matrix(daisy(DF))) would be equivalent):

> gower.dist(DF)
          [,1]      [,2]      [,3]
[1,] 0.0000000 0.8333333 0.3333333
[2,] 0.8333333 0.0000000 0.8333333
[3,] 0.3333333 0.8333333 0.0000000

You say you can't do it this way? Can you explain why not? As you seem to be going to some degree of effort to do something that other people have coded up for you already. This isn't homework, is it?

First of all, thanks Gavin and Dwin, for answering. I usesd the "gower.dist" function and I also used the "daisy()" function, although I didn't like the second because it made all the variables numeric, and I got different results with each function. I know that what I am trying to do is redo something that has already been done, but I can't do otherwise. It is for a research and I must do it using R. — billyt, Nov 29 '10 at 09:46
This colleague of mine used Matlab for verification. Because at Matlab there isn't the Gower distance coefficient ready, she implemented using the functions I posted at the first post. She has her code is alright, because when we tested her code and the "gower.dist" function at a numeric matrix, we got exactly the same matrix. But with this data, that I have to recode the ordinal columns to ordered and the nominal to factor, we got different results. I uploaded the data and my and her results here ( http://www.mediafire.com/?nx25hzxcmvq998o ) to test it yourself. I am sorry for troubling you. — billyt, Nov 29 '10 at 09:47
@bilyt: you couldn't be more **wrong** about daisy() converting your data to numerics. `daisy()` has been part of R for years and years and has more eyes pour over it than your colleagues Matlab code. — Gavin Simpson, Nov 29 '10 at 12:06
@billyt: gower's coefficient is for more than numerics. You can't test your colleagues Matlab code against `daisy()` in the numeric-only case and then assume that the Matlab code is correct because it gets different answers to `daisy()` when used with non-numeric data. Seriously, I've coded Gower's function in R for another package and I **know** daisy() is correct cause it gave published answers on a small test data set. — Gavin Simpson, Nov 29 '10 at 12:08
@ Gavin: Here is the data in .zip ( http://www.mediafire.com/?u8mux3lg47t61h1 ). For daisy(), I meant that although I had already converted the class of the data to "factor" and "ordered factor", it gave me different results from the gower.dist, so I assumed that it calculated the default class of the data. — billyt, Nov 30 '10 at 08:14
@billyt: Thanks for the download. Will take a look later today, and post something back here. — Gavin Simpson, Nov 30 '10 at 08:36
@ Gavin: if this can help, what I want to do is to find the distance matrix of this data. The data are either ordinal or nominal. So, the best solution was gower distance. I don't know if there is better coefficient for such type of data. — billyt, Dec 02 '10 at 08:34

score 0 · Answer 2 · answered Nov 28 '10 at 20:38

I'm not sure what your logic is doing, but you are putting too many "{" in there for your own good. I generally use the {} pairs to surround the consequent-clause:

Delta=function(i,j){for (i in 1:28) {for (j in 1:47){  
       if (MyHeader[i,j]=="nominal") {
         result=0
    # the "{" in the next line before else was sabotaging your efforts
        } else if (MyHeader[i,j]=="ordinal") { result=1} }
      result}
                  }

score 0 · Answer 3 · answered Dec 02 '10 at 19:18

Thanks Gavin and DWin for your help. I managed to solve the problem and find the right distance matrix. I used daisy() after I recoded the class of the data and it worked.

P.S. The solution that you suggested at my other topic for changing the class of the columns:

DF$nominal <- as.factor(DF$nominal)
DF$ordinal <- as.ordered(DF$ordinal)

didn't work. It changed only the first nominal and ordinal column.

Thanks again for your help.

implementation of the Gower distance function

3 Answers3

Linked