5

I'm interested in possible ways to model the cosine similarity algorithm using Solr. I have items which are assigned a vector, for example:

items = [
  { id: 1, vector: [0,0,0,2,3,0,0] },
  { id: 2, vector: [0,1,0,1,5,0,0] },
  { id: 3, vector: [2,3,0,0,0,1,0] },
  { id: 4, vector: [1,2,4,6,5,0,0] }
]

And a search vector to which the others need to be ranked.

Currently, I'm modeling this in ruby by running over all the items and assigning them a rank against the input vector. Here's the implementation of cosine similarity I'm using:

module SimilarityCalculator

  def self.get_similarity(vector1, vector2)
    dp = dot_product(vector1, vector2)
    nm = normalize(vector1) * normalize(vector2)
    dp / nm
  end

  private

  def self.dot_product(vector1, vector2)
    sum = 0.0
    vector1.each_with_index { |val, i| sum += val * vector2[i] }
    sum
  end

  def self.normalize(vector)
    Math.sqrt(vector.inject(0.0) { |m,o| m += o**2 })
  end

end

Then, to get a ranked list I would do something like the following:

ranked = []
search_vector = [1,0,0,3,5,0,0]
items.each do |item|
  rank = SimilarityCalculator.get_similarity(search_vector, item.vector)
  { id: item.id, rank: rank }
end

I don't know enough about Solr to know how this would be modeled or even if it can, but I thought I'd throw it out there.

JC Grubbs
  • 39,191
  • 28
  • 66
  • 75

1 Answers1

1

Lucene already uses a cosine similarity model, so the real question is: can you map your vectors into Lucene's vectors? And can you remove the norming etc. which Lucene does that you don't want?

You can always write your own scoring and analysis functions, so if you're willing to do some coding, the answer is a clear "yes". This might require more work than you want though.

For an option which may get you part of the way but doesn't require any coding: translate each dimension into the word "dim_n" and repeat it (or boost it) however many times is the vector's magnitude in that dimension. For example:

[1,2,0,1] ==> "dim_1 dim_2 dim_2 dim_4"

If your vectors are all of approximately the same size and evenly distributed across dimensions, this might be a very good approximation.

If you tell us more about the problem (e.g. do you really need to give Lucene vectors as input or can you give it text?) we might be able to find better solutions.

Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • I definitely need to give it vectors, although I could potentially see something like the text translation approach above working. I don't mind doing some coding but I'm on a ruby platform and really only have Solr at my disposal, so it needs to be something I can express through the Solr API. – JC Grubbs Feb 06 '12 at 21:32
  • Well, take a look at Lucene's [scoring formula](http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/Similarity.html) and decide if the text method will come close enough. Solr does let you [write function queries](http://wiki.apache.org/solr/FunctionQuery) without changing the source, which is another option. – Xodarap Feb 06 '12 at 22:50