What robust algorithm implementation can I use to perform phrase similarity with two inputs?

Question

This is the problem:

I have two columns in my matadata database "field name" and "field description"
I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
[Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)

Examples:

row	field_name	field_description
1	my_first_field	my first field
2	my_second_field	my------second------field
3	my_third_field	this is a description about the field, the descriprion can contain the name of the field itself

Where the examples 1^st and 2^nd are similars (thus wrong) and the 3^rd is correct.

I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1^st example).

Some of the implementations I tried to use:

[Edit]

I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.

field_name	field_description	Score
my_field_name	my_field_name	1.0000
second_field_name	second field name	0.8483
third_field_name	third-field-name	0.8717
fourth_field_name	this is a correct description field	0.4591
fifth_field_name	fifth_-------field_//////////////name	0.8454

score 2 · Answer 1 · answered Nov 08 '22 at 15:46

2

For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.

But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.

answered Nov 08 '22 at 15:46

Erwan

1,385
1
12
22

Thanks for your response, I have tried Leveinshtein Distance but the problem is: how do you fix a threshold with so many cases?(preprocessing in this case is not useful as the list of cases is too long). I agree that maybe training a model for such a purpose would be the best option but I wanted to look for other options first. By the way, I have just tried the implementation of HuggingFace [semantic-textual-similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html#semantic-textual-similarity) with nice results. – emichester Nov 08 '22 at 17:39
1

@emichester Deciding the threshold can be done in different ways, but the goal is always for the value to separate as well as possible the cases. If you want do it really properly, take a sample of annotated instances and pick the value which maximizes accuracy on this sample (parameter tuning). – Erwan Nov 08 '22 at 22:53
Interesting approach, but, which values should be the inputs of the optimizer? x1,x2,x3 directly? - x1 : field_name - x2 : field_description - x3 : lev distance - y : annotation Or just a simple optimizer with x3 as input? – emichester Nov 10 '22 at 17:50
1

@emichester in this simplistic approach there's no need to keep the text values, you use only the numerical distance value. The idea is that the distance represents how much the text differ. – Erwan Nov 10 '22 at 18:22

What robust algorithm implementation can I use to perform phrase similarity with two inputs?

1 Answers1