We are trying to clean out call center records where there is free text entered by the call center agent but no product assigned to the text. We would like to take the free text and compare it to the product list and find the best match of a product within that list.
I have tried the STRINGDIST
package within R and while I can get a matrix back with results the distance values are not what I would expect.
Example:
"I was told by a salesperson that the foundation light contains a small amount of SPF. Is this true?"
PRODUCT NAMES:
FDN LGHT
FOUNDATION LIGHT
LIPSTICK
LIGHT LIPSTICK
I would expect the results to score "Foundation Light" the highest and then subsequent rankings to the remaining items, with "Lipstick" having no score as there was no match.
Please note that if you think this could be done in another language I will gladly take any recommendation.