Similarity matching algorithm

Question

I have products with different details in different attributes and I need to develop an algorithm to find the most similar to the one I'm trying to find.

For example, if a product has:

Weight: 100lb
Color: Black, Brown, White
Height: 10in
Conditions: new

Others can have different colors, weight, etc. Then I need to do a search where the most similar return first. For example, if everything matches but the color is only Black and White but not Brown, it's a better match than another product that is only Black but not White or Brown.

I'm open to suggestions as the project is just starting.

One approach, for example, I could do is restrict each attribute (weight, color, size) a limited set of option, so I can build a binary representation. So I have something like this for each product:

Colors       Weight    Height    Condition
00011011000  10110110  10001100  01

Then if I do an XOR between the product's binary representation and my search, I can calculate the number of set bits to see how similar they are (all zeros would mean exact match).

The problem with this approach is that I cannot index that on a database, so I would need to read all the products to make the comparison.

Any suggestions on how I can approach this? Ideally I would like to have something I can index on a database so it's fast to query.

Further question: also if I could use different weights for each attribute, it would be awesome.

not sure if this site is the best for that question because is very broad. Maybe http://programmers.stackexchange.com/ will be more apropiated — Juan Carlos Oropeza, Jan 08 '16 at 16:48
I'd suggest looking at the topic of "Unsupervised machine learning." This kind of problem can be addressed by *clustering* or *cosine similarity*, for instance. — rajah9, Jan 08 '16 at 16:54
I don't understand your description of the color matching rule. — , Jan 08 '16 at 16:57
Is there a reason not to handle the numerical values as non-numerical ? — , Jan 08 '16 at 16:58
@YvesDaoust No, I just put an example. Numbers can be numbers :) — Diego Jancic, Jan 08 '16 at 17:00
@JuanCarlosOropeza when referring other sites, it is often helpful to point that [cross-posting is frowned upon](http://meta.stackexchange.com/tags/cross-posting/info) — gnat, Jan 08 '16 at 18:35
@gnat he didn't say I cross-posted, he is saying that it would be more appropriate the other site as it's too broad to be here (not sure what site it should be in, but that's what's he's saying) — Diego Jancic, Jan 08 '16 at 20:53
@DiegoJancic Did you figure out an approach to this problem? I would be interested in knowing how you solved this. — Mohamed Sohail, Jun 18 '22 at 08:47
@MohamedSohail I don't remember to be honest. I believe I went with ElasticSearch/Lucene, it's great for that kind of searches — Diego Jancic, Jun 18 '22 at 13:18

score 2 · Answer 1 · answered Jan 08 '16 at 19:10

You basically need to come up with a distance metric to define the distance between two objects. Calculate the distance from the object in question to each other object, then you can either sort by minimum distance or just select the best.

Without some highly specialized algorithm based on the full data set, the best you can do is a linear time distance comparison with every other item.

You can estimate the nearest by keeping sorted lists of certain fields such as Height and Weight and cap the distance at a threshold (like in broad phase collision detection), then limit full distance comparisons to only those items that meet the thresholds.

score 1 · Accepted Answer · answered Jan 08 '16 at 16:56

1

What you want to do is a perfect use case for elasticsearch and other similar search oriented databases. I don't think you need to hack with bitmasks/etc.

You would typically maintain your primary data in your existing database (sql/cassandra/mongo/etc..anything works), and copy things that need searching to elasticsearch.

answered Jan 08 '16 at 16:56

Yash Gupta

578
6
15

Thanks. I though about this, I already use Lucene. I never did that kind of search (I tend to do an AND to the terms), but there's probably a way to do that kind of search. – Diego Jancic Jan 08 '16 at 17:01
elastic search does have functionality to give weights to partial matches, but I don't know if it manages to do it in sub-linear time. With some specialized indexing data structure, it is possible to do nearest neighbour lookups in log time (space partitioning trees come to mind). – Yash Gupta Jan 08 '16 at 17:12
1

Thanks @YashGupta, I found this other answer very interesting based on what you mentioned: http://stackoverflow.com/questions/5751114/nearest-neighbors-in-high-dimensional-data – Diego Jancic Jan 08 '16 at 17:38
1

Yes, that is exactly what I meant. You can also look at "self organizing maps" for a vaguely approximate, but fast algorithm. – Yash Gupta Jan 08 '16 at 22:11

valdem · Answer 3 · 2016-01-10T13:26:42.597

What are you talking about very similar to BK-trees. BK-tree constructs search tree with some metric associated with keys of this tree. Most common use of this tree is string corrections with Levenshtein or Damerau-Levenshtein distances. This is not static data structure, so it supports future insertions of elements. When you search exact element (or insert element), you need to look through nodes of this tree and go to links with weight equal to distance between key of this node and your element. if you want to find similar objects, you need to go to several nodes simultaneously that supports your wishes of constrains of distances. (Maybe it can be even done with A* to fast finding one most similar object).

Simple example of BK-tree (from the second link)

          BOOK
         /    \
        /(1)   \(4)
       /        \
    BOOKS      CAKE
      /       /    \
     /(2)    /(1)   \(2)
    /        |      |
  BOO      CAPE    CART

Your metric should be Hamming distance (count of differences between bit representations of two objects).

BUT! is it good to compare two integers as count of different bits in their representation? With Hamming distance HD(10000, 00000) == HD(10000, 10001). I.e. difference between numbers 16 and 0, and 16 and 17 is equal. Is it really what you need?

BK-tree with details: https://hamberg.no/erlend/posts/2012-01-17-BK-trees.html https://nullwords.wordpress.com/2013/03/13/the-bk-tree-a-data-structure-for-spell-checking/

Similarity matching algorithm

3 Answers3