2

This is the problem:

  • I have two columns in my matadata database "field name" and "field description"
  • I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
  • [Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)

Examples:

row field_name field_description
1 my_first_field my first field
2 my_second_field my------second------field
3 my_third_field this is a description about the field, the descriprion can contain the name of the field itself

Where the examples 1st and 2nd are similars (thus wrong) and the 3rd is correct.

I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1st example).

Some of the implementations I tried to use:

[Edit]

I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.

field_name field_description Score
my_field_name my_field_name 1.0000
second_field_name second field name 0.8483
third_field_name third-field-name 0.8717
fourth_field_name this is a correct description field 0.4591
fifth_field_name fifth_-------field_//////////////name 0.8454
emichester
  • 189
  • 9

1 Answers1

2

For your examples, the Levenshtein edit distance would work very well. It can also be 'customized', or you could use some preprocessing depending on your data.

But your text description of the problem makes me think that the real problem is likely much more complex, and maybe not even easy to define formally. It looks like you actually need a more semantic method, and this would probably require training a model with annotated data.

Erwan
  • 1,385
  • 1
  • 12
  • 22
  • Thanks for your response, I have tried Leveinshtein Distance but the problem is: how do you fix a threshold with so many cases?(preprocessing in this case is not useful as the list of cases is too long). I agree that maybe training a model for such a purpose would be the best option but I wanted to look for other options first. By the way, I have just tried the implementation of HuggingFace [semantic-textual-similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html#semantic-textual-similarity) with nice results. – emichester Nov 08 '22 at 17:39
  • 1
    @emichester Deciding the threshold can be done in different ways, but the goal is always for the value to separate as well as possible the cases. If you want do it really properly, take a sample of annotated instances and pick the value which maximizes accuracy on this sample (parameter tuning). – Erwan Nov 08 '22 at 22:53
  • Interesting approach, but, which values should be the inputs of the optimizer? x1,x2,x3 directly? - x1 : field_name - x2 : field_description - x3 : lev distance - y : annotation Or just a simple optimizer with x3 as input? – emichester Nov 10 '22 at 17:50
  • 1
    @emichester in this simplistic approach there's no need to keep the text values, you use only the numerical distance value. The idea is that the distance represents how much the text differ. – Erwan Nov 10 '22 at 18:22