3

I have a database of thousands of different colours. I want to map them to one of the colours I have in a list.

Before this database of colours was only a few hundred and I managed this with something like the code below. This is now getting unmaintainable as this database of unclassified colours is growing and takes me a lot of time every week to map.

How can I improve this or what would be a better approach?

mapped_colours = ['Red', 'Green', 'Yellow', 'Blue', 'White', 'Black', 'Pink', 'Purple'...]

colour_map_dict = {
    'olive': 'Green',
    'khaki': 'Green'
}

def classify_colour(colour):
    for mp in mapped_colours:
        if mp.lower() in colour.lower():
            return mp

    for map, colour in colour_map_dict.items():
        if map in colour.lower():
            return colour

Here is an example of the data coming in.

 Resin Dark Wash Indi
 Filtered Canyon
 999 Black
 Winter White/Dove Grey
 Midnight/min
 White & black
 Green/White
 Red/White
 Multicolor
 royal blue
 Black Plum Grey
 Rose/ Gold
 Red And White
 Offwht/Gg
 Black Gunmetal
 Berry/Black
 Caramel
 Blue Stone Bleached
 All Tan
 Pale Blush
 Tee
 White / Multi
 00-black
 Flat Foundation
 Baby Blue
 Beige Melange
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
lennard
  • 523
  • 6
  • 19
  • Much better wording than the last time :-) – Martijn Pieters Jun 15 '16 at 09:11
  • @MartijnPieters thanks :) – lennard Jun 15 '16 at 09:12
  • Could you store the mapping also in the database? – jonrsharpe Jun 15 '16 at 09:12
  • Create this mapping in your DB and manage it using DB engine? – Pax0r Jun 15 '16 at 09:12
  • @jonrsharpe the main problem I think is manually creating this mapping table is getting very time consuming. There is new data coming in everyday. – lennard Jun 15 '16 at 09:15
  • What is the core *problem* you're trying to solve? Is the issue in figuring out which of the colours in your list an arbitrary colour maps to? If so you could do something to minimise the distance in e.g. RGB space between that colour and those in your list, but doing this from just the name would be a bit tricky. What other information is in the database? – jonrsharpe Jun 15 '16 at 09:17
  • @jonrsharpe: I have a bit of context from a previously closed OT question here: the OP is trying to group colour names, because the manual process of adding new names every day is time consuming. Unfortunately, I don't think there is any better way than building that database manually, or buying a pre-existing database (but I don't think one exists). – Martijn Pieters Jun 15 '16 at 09:19
  • @jonrsharpe i've added an example of the data coming in. MartijnPieters is right. I'm trying to find a better way todo this. Maybe there is a NLP colour library, or sklearn classification, or maybe i'm currently dealing with this the best way. – lennard Jun 15 '16 at 09:21
  • @MartijnPieters sadly not, I suspect; if they only have the *names* of the colours this would be extremely difficult. – jonrsharpe Jun 15 '16 at 09:21
  • 1
    It seems like this is a more-or-less impossible task. Just within the few examples you give there are several that cannot be mapped to a single colour in your list (*"Multicolor"*?!) Unless you can get RGB/HSV for the actual colour being described it's hard to see how you could move forward in an automated way. Perhaps you could use something like Mechanical Turk or CrowdFlower to crowd-source the classification? – jonrsharpe Jun 15 '16 at 09:24

2 Answers2

3

I'd start with a decent colour dictionary to map names to colour definitions in a given colour space (like RGB or CMYK or HSV). There are various sets available on the internet; you'll have to do work up-front to obtain them and normalise the data from each to use the same colour space. The more sources your can obtain, the richer your mapping; you appear to have a load of fashion colours (paint? cloth?) in your input set, and (commercial) fashion is forever trying to differentiate by inventing new colour names.

Because a colour space is finite, you can then algorithmically partition that space into a limited set of groups. Each colour name then automatically will map to a given group.

Looking around a bit, a good starting point would be the Wikipedia lists of colour names. The compact list should be easily machine parseable, even in the basic HTML form, or you can use the MediaWiki API to get a raw format that's even easier to parse. Then perhaps add other standardised colour name dictionaries; the goal here is to get as many names as possible all mapping to the same colour space.

I'd store these names in a database table, and have a simple mathematical formula ready to divide the colour space into your basic groups. That way any colour in the table can be mapped to (say) RGB, and RGB to simple name.

Next, build a simple spell-checker trained on your database of names, and run your input through that first. You have some pretty hard-to-work-with data there, but a trained colour name spell checker can probably clean up Offwht/Gg to something that can be matched. And use the natural text search to find partial matches.

Note that if you have image data with those colour names you receive, you'd find the most prevalent colour in that image, and then you have another name (from your input data) -> colour space mapping to use.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks this seems like the best approach I have right now. I'll give it a go :) – lennard Jun 15 '16 at 09:36
  • @lennard: I've added in some more info; once you have a database of colour names you can also build a spell checker from those names to map those many *many* misspellings and abbreviations you appear to have to something that exists in your database. – Martijn Pieters Jun 15 '16 at 09:51
  • Great idea with the spell checker. I did not think of that. Yes I do have the image data and already have done the work to find the most dominant colour based on the image. The images are clothes so some category's work really well and others which contain a lot of skin (bikini's) I had really low accuracy. Thanks for such a well detailed answer. – lennard Jun 15 '16 at 09:59
  • @lennard: use [image classification](http://stackoverflow.com/questions/18899939/what-are-good-features-for-classifying-photos-of-clothing) in that case. – Martijn Pieters Jun 15 '16 at 10:00
1

Once you have a large database of names to correct answers (see Martijn's answer), you could use that database to train a classification algorithm, for example one from scikit-learn:

#!/usr/bin/env python3

from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer

mapped_colours = ['Red', 'Green', 'Yellow', 'Blue', 'White', 'Black', 'Pink', 'Purple']

colour_map = [
    ('olive', 'Green'),
    ('khaki', 'Green'),
    ('snow white', 'White'),
    ('alice white', 'White'),
    ('pale blush', 'Pink'),
    ('baby blue', 'Blue'),
    ('midnight', 'Blue'),
    # ...and so on and so on - you'll need a lot of these
]

# A classifier classifies inputs into categories (colors in this case)
clf = svm.SVC(gamma=0.001, C=100.)

# A vectorizer turns strings into arrays which can be used as input
vectorizer = CountVectorizer()

# Train both the classifier and the vectorizer. This can take some time.
training = vectorizer.fit_transform([k for (k, v) in colour_map])
clf.fit(training, [mapped_colours.index(v) for (k, v) in colour_map])

# Predict some colors!
while True:
    query = input('Enter a color: ')
    guess = clf.predict(vectorizer.transform([query]))[0]
    print('Maybe', mapped_colours[guess])

Example run:

Enter a color: snow
Maybe White
Enter a color: dark khaki
Maybe Green
Enter a color: baby bedroom
Maybe Blue

You could alternatively have your model try to predict a RGB color, if your input data is already in RGB form, and work form there.

Because of the very short input, the classifier will likely not get very smart, but if the database is large enough it could perhaps make the job of adding colors a bit easier: if the classifier guesses correctly, just add its guess as a color. If not, you will still need to manually classify it, but the classifier will pick up the correct answer in future runs.


Disclaimer: I'm not sure if SVC is a right fit (heh) for your problem, but it might be Good Enough and worth a try.

Wander Nauta
  • 18,832
  • 1
  • 45
  • 62