What is the best way to get accurate text similarity in python for comparing single words or bigrams?

Question

I've got similar product data in both the products_a array and products_b array:

products_a = [{color: "White", size: "2' 3\""}, {color: "Blue", size: "5' 8\""} ]
products_b = [{color: "Black", size: "2' 3\""}, {color: "Sky blue", size: "5' 8\""} ]

I would like to be able to accurately tell similarity between the colors in the two arrays, with a score between 0 and 1. For example, comparing "Blue" against "Sky blue" should be scored near 1.00 (probably like 0.78 or similar).

Spacy Similarity

I tried using spacy to solve this:

import spacy
nlp = spacy.load('en_core_web_sm')

def similarityscore(text1, text2 ):
    doc1 = nlp( text1 )
    doc2 = nlp( text2 )
    similarity = doc1.similarity( doc2 )
    return similarity

Yeah, well when passing in "Blue" against "Sky blue" it scores it as 0.6545742918773636. Ok, but what happens when passing in "White" against "Black"? The score is 0.8176945362451089... as in spacy is saying "White" against "Black" is ~81% similar! This is a failure when trying to make sure product colors are not similar.

Jaccard Similarity

I tried Jaccard Similarity on "White" against "Black" using this and got a score of 0.0 (maybe overkill on single words but room for future larger corpuses):

# remove punctuation and lowercase all words function
def simplify_text(text):
    for punctuation in ['.', ',', '!', '?', '"']:
        text = text.replace(punctuation, '')
    return text.lower()

# Jaccard function
def jaccardSimilarity(text_a, text_b ):
    word_set_a, word_set_b = [set(self.simplify_text(text).split())
                                for text in [text_a, text_b]]
    num_shared = len(word_set_a & word_set_b)
    num_total = len(word_set_a | word_set_b)
    jaccard = num_shared / num_total
    return jaccard

Getting differing scores of 0.0 and 0.8176945362451089 on "White" against "Black" is not acceptable to me. I keep seeking a more accurate way of solving this issue. Even taking the mean of the two would be not accurate. Please let me know if you have any better ways.

The problem with computing similarities using word embeddings(like using spacy) here is that the word which is contextually similar or related to a similar concept can have embeddings that are nearby. Hence both words "black" and "white" are colours and hence might score higher for similarity. — Shivam Patel, Oct 03 '21 at 04:48
The Sequence-based similarity algorithms from [textdistance](https://pypi.org/project/textdistance/) might be worth a try. — Agnij, Oct 03 '21 at 04:52
What you might be looking for here is some sort of 'word colour embedding. Which takes in a word and has an embedding in the colour space. Then computing similarities would actually make much more sense. You might want to have a look at https://opensource.com/article/17/9/color-naming-word-embeddings — Shivam Patel, Oct 03 '21 at 04:52
@Agnij The textdistance metrics are based on the characters in the string, not on meaning, and are thus completely unrelated to this question. — polm23, Oct 05 '21 at 02:28
I'm kind of thinking of running all the text similarity functions provided in textdistance, then doing 100 "matching events" tests and checking to see which function performed the best. Choosing 100 because it's nice and easy to discern a percentage from that... like if `Ratcliff-Odbershelp` REALLY showed it got it right x amount of times, then x% it is. What do you think? — rom, Oct 05 '21 at 02:49
@polm23 Yes, but in this context dealing with colours in the sense as all **colours have distinct names** this measure is sure to give a **low similarity** from the aspect of comparing different colours, but agreed for similar colours - (bi-gram+unigram) colour names **'sky blue'** with **'blue'** it might perform slightly off, but again it is based on **experimentation** with all the distance metrics and OP's **threshold/tolerance.** — Agnij, Oct 05 '21 at 02:49
@rom No matter how many metrics you use, textdistance will never figure out that "salmon" is the same as "pink". It's the wrong tool. — polm23, Oct 05 '21 at 02:51
@polm23 ok, I guess the "safest" way to avoid errors is to throw an Array in front of it, which contains some sort of "decoder" mapping... like `{ "white": [ "White", "Off white"] }` before it gets to the moment requiring throwing it into the obstacle course of text similarity algorithms. Yeah, this is turning into a classification problem like you said.. — rom, Oct 05 '21 at 02:53
text distance should only be considered keeping into account you know the colours that may come in, otherwise no point in spending time on it. (exotic colour will practically render text distance pointless) — Agnij, Oct 05 '21 at 02:57
Can you get the color code instead of the color name ? or you could map the color name to a color code using a dictionary, then use the average of the difference between the RGB values in the color code. — Balaji, Oct 08 '21 at 13:29

score 2 · Answer 1 · answered Oct 08 '21 at 14:44

I found some methods that might be helpful.I am new to programming so don't really know how to implement your data set. Still wanted to share it.

from difflib import SequenceMatcher
#https://towardsdatascience.com/sequencematcher-in-python-6b1e6f3915fc

s1 = "blue"
s2 = "sky blue"
sim = SequenceMatcher(None, s1, s2).ratio()
print("Similarity between two strings is: " + str(sim) )

this code says Similarity between two strings is: 0.6666666666666666. I tried the same code for black and white. It says similarity between two strings is: 0.0

Note: I think Sklearn modules Affinity Propagation and Levenstein distance might be helpful.but dont know how to implement them to your questions.

score 2 · Answer 2 · answered Oct 10 '21 at 11:26

You want to frist convert the color name to hex, and then compare two hex values. Do not compare strings!

import math
from difflib import SequenceMatcher
from matplotlib import colors
COLOR_NAMES = list(colors.CSS4_COLORS.keys()) #choose any color module you want and get a list of all colors

def hexFromColorName(name):
    name = name.lower() #matplotlib names are lowercase
    closest_match = [0, ""]
    for colorname in COLOR_NAMES:
        sim = SequenceMatcher(None, name, colorname).ratio()
        #print("Similarity between two strings is: " + str(sim) )
        if sim > closest_match[0]:
            closest_match = sim, colorname

    #use maptplotlib's color conversion dictionary to get hex values
    return colors.CSS4_COLORS[closest_match[1]]
    

def compareRGB(color1, color2):
    color1, color2 = color1[1:], color2[1:] #trim the # from hex color codes

    #convert from hex string to decimal tuple
    color1 = (int(color1[:2], base=16), int(color1[2:4], base=16), int(color1[4:], base=16))  
    color2 = (int(color2[:2], base=16), int(color2[2:4], base=16), int(color2[4:], base=16))

    #standard euclidean distance between two points in space
    dist =  math.sqrt(
                        math.pow((color1[0]-color2[0]), 2) +
                        math.pow((color1[1]-color2[1]), 2) +
                        math.pow((color1[2]-color2[2]), 2) 
                     )/255/math.sqrt(3)      
    if dist > 1: dist = 1
    return 1 - dist

>>> compareRGB(hexFromColorName('dark green'),hexFromColorName('green'))
0.9366046763242764
>>> compareRGB(hexFromColorName('Light Blue'),hexFromColorName('Black'))
0.18527897735531407

Nic · Accepted Answer · 2021-11-03T02:44:40.790

NLP packages may be better at longer text fragments and more sophisticated text analysis.

As you've discovered with 'black' and 'white', they make assumptions about similarity that are not right in the context of a simple list of products.

Instead you can see this not as an NLP problem, but as a data transformation problem. This is how I would tackle it.

To get the unique list of colors in both lists use set operations on the colors found in the two product lists. "set comprehensions" get a unique set of colors from each product list, then a union() on the two sets gets the unique colors from both product lists, with no duplicates. (Not really needed for 4 products, but very useful for 400, or 4000.)

products_a = [{'color': "White", 'size': "2' 3\""}, {'color': "Blue", 'size': "5' 8\""} ]
products_b = [{'color': "Black", 'size': "2' 3\""}, {'color': "Sky blue", 'size': "5' 8\""} ]

products_a_colors = {product['color'].lower() for product in products_a}
products_b_colors = {product['color'].lower() for product in products_b}
unique_colors = products_a_colors.union(products_b_colors)
print(unique_colors)

The colors are lowercased because in Python 'Blue' != 'blue' and both spellings are found in your product lists.

The above code finds these unique colors:

{'black', 'white', 'sky blue', 'blue'}

The next step is to build an empty color map.

colormap = {color: '' for color in unique_colors}
import pprint
pp = pprint.PrettyPrinter(indent=4, width=10, sort_dicts=True)
pp.pprint(colormap)

Result:

{
    'sky blue': '',
    'white': '',
    'black': '',
    'blue': ''
}

Paste the empty map into your code and fill out mappings for your complex colors like 'Sky blue'. Delete simple colors like 'white', 'black' and 'blue'. You'll see why below.

Here's an example, assuming a slightly bigger range of products with more complex or unusual colors:

colormap = {
    'sky blue': 'blue',
    'dark blue': 'blue',
    'bright red': 'red',
    'dark red': 'red',
    'burgundy': 'red'
}

This function helps you to group together colors that are similar based on your color map. Function color() maps complex colors onto base colors and drops everything into lower case to allow 'Blue' to be considered the same as 'blue'. (NOTE: the colormap dictionary should only use lowercase in its keys.)

def color(product_color):
    return colormap.get(product_color.lower(), product_color).lower()

Examples:

>>> color('Burgundy')
'red'
>>> color('Sky blue')
'blue'
>>> color('Blue')
'blue'

If a color doesn't have a key in the colormap, it passes through unchanged, except that it is converted to lowercase:

>>> color('Red')
'red'
>>> color('Turquoise')
'turquoise'

This is the scoring part. The product function from the standard library is used to pair items from product_a with items from product_b. Each pair is numbered using enumerate() because, as will become clear later, a score for a pair is of the form (pair_id, score). This way each pair can have more than one score.

'cartesian product' is just a mathematical name for what itertools.product() does. I've renamed it to avoid confusion with product_a and product_b. itertools.product() returns all possible pairs between two lists.

from itertools import product as cartesian_product
product_pairs = {
    pair_id: product_pair for pair_id, product_pair
    in enumerate(cartesian_product(products_a, products_b))
}
print(product_pairs)

Result:

{0: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Black', 'size': '2\' 3"'}),
 1: ({'color': 'White', 'size': '2\' 3"'}, {'color': 'Sky blue', 'size': '5\' 8"'}),
 2: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Black', 'size': '2\' 3"'}),
 3: ({'color': 'Blue', 'size': '5\' 8"'}, {'color': 'Sky blue', 'size': '5\' 8"'})
}

The list will be much longer if you have 100s of products.

Then here's how you might compile color scores:

color_scores = [(pair_id, 0.8) for pair_id, (product_a, product_b)
                in product_pairs.items()
                if color(product_a['color']) == color(product_b['color'])]
print(color_scores)

In the example data, one product pair matches via the color() function: pair number 3, with the 'Blue' product in product_a and the 'Sky blue' item in product_b. As the color() function evaluates both 'Sky blue' and 'blue' to the value 'blue', this pair is awarded a score, 0.8:

[(3, 0.8)]

"deep unpacking" is used to extract product details and the "pair id" of the current product pair, and put them in local variables for processing or display. There's a nice tutorial article about "deep unpacking" here.

The above is a blueprint for other rules. For example, you could write a rule based on size, and give that a different score, say, 0.5:

size_scores = [(pair_id, 0.5) for pair_id, (product_a, product_b)
               in product_pairs.items()
               if product_a['size'] == product_b['size']]
print(size_scores)

and here are the resulting scores based on the 'size' attribute.

[(0, 0.5), (3, 0.5)]

This means pair 0 scores 0.5 and pair 3 scores 0.5 because their sizes match exactly.

To get the total score for a product pair you might average the color and size scores:

print()
print("Totals")
score_sources = [color_scores, size_scores]  # add more scores to this list
all_scores = sorted(itertools.chain(*score_sources))
pair_scores = itertools.groupby(all_scores, lambda x: x[0])
for pair_id, pairs in pair_scores:
    scores = [score for _, score in pairs]
    average = sum(scores) / len(scores)
    print(f"Pair {pair_id}: score {average}")
    for n, product in enumerate(product_pairs[pair_id]):
        print(f"  --> Item {n+1}: {product}")

Results:

Totals
Pair 0: score 0.5
  --> Item 1: {'color': 'White', 'size': '2\' 3"'}
  --> Item 2: {'color': 'Black', 'size': '2\' 3"'}
Pair 3: score 0.65
  --> Item 1: {'color': 'Blue', 'size': '5\' 8"'}
  --> Item 2: {'color': 'Sky blue', 'size': '5\' 8"'}

Pair 3, which matches colors and sizes, has the highest score and pair 0, which matches on size only, scores lower. The other two pairs have no score.

score 1 · Answer 4 · answered Oct 05 '21 at 02:49

If your actual goal here is to handle colors on product descriptions you should treat this as a classification problem, though note that for short text this is going to be very hard. Luckily most items should use common colors so it shouldn't be hard to get good coverage. I suspect picking 12 or so colors and classifying into them would be easier than making good color name embeddings.

I would not use string distance metrics like Jaccard Distance. They just tell you how many of the letters or word-chunks are the same between two strings, they don't do anything with meaning.

As mentioned in the comments, normal word vectors won't find opposites for you. You can read more about why this is hard here. The advice of working with color name word embeddings is very good, and is the best way to get a similarity score.

score 1 · Answer 5 · answered Oct 09 '21 at 05:53

1

Gensim has a Python implementation of Word2Vec which provides word similarity

from gensim.models import Word2Vec
model = Word2Vec.load(path/to/your/model)
model.wv.similarity('Chennai', 'London')

answered Oct 09 '21 at 05:53

Kum_R

368
2
19

What is the best way to get accurate text similarity in python for comparing single words or bigrams?

Spacy Similarity

Jaccard Similarity

5 Answers5