2

I am trying to compute the jaccard similarity between each pair of names in large vectors of names (see below for small example) and to store their jaccard similarity in a matrix. My function is just returning NULL. What am I doing wrong?

library(dplyr)

df = data.frame(matrix(NA, ncol=3, nrow=3))
df = df %>%
    mutate_if(is.logical, as.numeric)

names(df) = c("A.J. Doyle", "A.J. Graham", "A.J. Porter")
draft_names = names(df) 
row.names(df) = c("A.J. Feeley", "A.J. McCarron", "Aaron Brooks")
quarterback_names = row.names(df)

library(stringdist)

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
    }
  }
}

df = jaccard_similarity(df)
Altamash Rafiq
  • 349
  • 1
  • 2
  • 10
  • I would try looking at if quarterback_names and draft_names have the input you gave them. I am not sure, but `names(df) = c("A.J. Doyle", "A.J. Graham", "A.J. Porter")` may have an error. – bala83 Mar 26 '18 at 19:36
  • There is no error that I can detect. Everything above the for-loop is doing exactly what you would expect. – Altamash Rafiq Mar 26 '18 at 19:40
  • You should use the `stringdistmatrix` function: `stringdistmatrix(quarterback_names, draft_names, method = "jaccard", q = 2)`. – Scarabee Mar 26 '18 at 20:42

3 Answers3

3

You are not returning anything after the for loops. Use return(d) at the end of the function.

This problem is also a classic use case for outer:

outer(quarterback_names,draft_names,FUN=stringdist,method="jaccard",q=2)
          [,1]      [,2]      [,3]
[1,] 0.6428571 0.7500000 0.7500000
[2,] 0.7647059 0.7777778 0.7777778
[3,] 1.0000000 1.0000000 1.0000000
James
  • 65,548
  • 14
  • 155
  • 193
2

You need to return your changed dataframe:

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
    }
  }
  return(d)
  // ^^^
}


Afterwards jaccard_similarity(df) yields
              A.J. Doyle A.J. Graham A.J. Porter
A.J. Feeley    0.6428571   0.7500000   0.7500000
A.J. McCarron  0.7647059   0.7777778   0.7777778
Aaron Brooks   1.0000000   1.0000000   1.0000000
Jan
  • 42,290
  • 8
  • 54
  • 79
0

Reason : There is no explict return.

Reference

you can add print and debug like below and trace

jaccard_similarity = function(d){
  for (i in 1:nrow(d)){
    for(j in 1:ncol(d)){
      d[i,j] = stringdist(quarterback_names[i], draft_names[j], method ='jaccard', q=2)
      print(d[i,j])
    }
  }
  return(d)
}

Output:

[1] 0.6428571
[1] 0.75
[1] 0.75
[1] 0.7647059
[1] 0.7777778
[1] 0.7777778
[1] 1
[1] 1
[1] 1

You can simply call jaccard_similarity(df) too get the values.

output  <-jaccard_similarity(df)

              A.J. Doyle A.J. Graham A.J. Porter
A.J. Feeley    0.6428571   0.7500000   0.7500000
A.J. McCarron  0.7647059   0.7777778   0.7777778
Aaron Brooks   1.0000000   1.0000000   1.0000000

And assign the output to new variable rather overriding existing df.

Morse
  • 8,258
  • 7
  • 39
  • 64