This is the problem:
- I have two columns in my matadata database "field name" and "field description"
- I need to check if the "field description" is actually a description and not some sort of transformation of the "field name"
- [Edit] I need to avoid preprocessing the text to remove separators, as I would have to consider a long list of cases (e.g. _-;$%/^| etc.)
Examples:
row | field_name | field_description |
---|---|---|
1 | my_first_field | my first field |
2 | my_second_field | my------second------field |
3 | my_third_field | this is a description about the field, the descriprion can contain the name of the field itself |
Where the examples 1st and 2nd are similars (thus wrong) and the 3rd is correct.
I have tried some implementations based on Leveinshtein Distance, difflib, Cosine Similarity and an implementation called spaCy but none of them was robust with my examples (throwing only around 50% of similarity rate with the 1st example).
Some of the implementations I tried to use:
- https://towardsdatascience.com/surprisingly-effective-way-to-name-matching-in-python-1a67328e670e
- https://spacy.io/usage/linguistic-features#vectors-similarity
- https://docs.python.org/3/library/difflib.html
- is there a way to check similarity between two full sentences in python?
[Edit]
I have just tried the implementation of HuggingFace semantic-textual-similarity with nice results.
field_name | field_description | Score |
---|---|---|
my_field_name | my_field_name | 1.0000 |
second_field_name | second field name | 0.8483 |
third_field_name | third-field-name | 0.8717 |
fourth_field_name | this is a correct description field | 0.4591 |
fifth_field_name | fifth_-------field_//////////////name | 0.8454 |