I need to automatically match product names (food). The problem is similar to Fuzzy matching of product names
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names Lenovo T400
, Lenovo R400
and New Lenovo T-400, Core 2 Duo
.
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T
and 400R
), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
But there's a complication: My strings have mistakes because the files I want to search are the result from image recognition. The product titles don't have spaces in those files.
For example, I want to find product name cookiesoreovarianta
and I have strings
cookiesoreovariantb (a real other product)
cookiesoreovariamtq (a real other product, "a" and "q" are similar symbols in some fonts)
cookiesoreovariamta (just a mistake)
I do not have not a full database of canonical names.
How would I approach this. Any ideas?