What mechanism can be used to quantify similarity between non-numeric lists?

Question

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. If you are given a recipe how would you identify similar recipes allowing for variations and omissions? For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour.

The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. Is there another way to do this? Something to the equivalent of perceptual hashing would be ideal!

score 0 · Answer 1 · answered Jun 16 '17 at 18:23

How about cosine similarity?

This technique is commonly used in Machine Learning for text recognition as a similarity measure. With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike).

Take a look at this great question that explains cosine similarity in a simple way. In general, you could use any similarity measure to obtain a distance to compare your recipe. This article talks about different similarity measures, you can check it out if you wish to know more.

What mechanism can be used to quantify similarity between non-numeric lists?

1 Answers1