Recognize different word classes

Question

I have a large list of clothes from an online shop.
Every entry consists of a main words (e.g. shirt, jacket) and one or more attributes (e.g. size, color).

Clothes could have the following attributes:

Size: XS, S, M, L, XL
Color: GREEN, RED, BLUE, ...
Category: SPORT, BUSINESS, CASUAL, ...
Gender: MALE, FEMALE, UNISEX
...

Let's have a look at an example.

SPO SHIRT L 
CAS SHIRT M GRE
CAS SHIRT L RED
BUS SHIRT XS RED

CAS SHOES
SHOES BLACK

JACKET FEM M
JACKET FEM GRE
CAS JACKET MA RED

The items are populated by humans, so it is inconsistent. Not every entry has every attribute. Also they may be in different orders.

I now want to automatically recognize the main words and the attributes. Also I want to group the attributes in groups shown above.

An example output is just the groups of words.

Main words: SHIRT, SHOES, JACKETS, ...
Attributes: L, M, XS, GRE, RED, CAS, SPO, BUS, FEM, ...

And also the categories of the attributes:

L, M, XS, ...
SPO, CAS, BUS, ...
GRE, RED, ...

Of course, the algorithm won't be able to provide suitable names for the categories (e.g. size, color).

What I already know:

Attributes of the same category (= same semantic) are mutually exclusiv. So a shirt can only have 1 size, 1 color, etc.
The number of attributes is much higher than the number of main words.
- So there are for example 5 different shoes, but the size L will occur 20 times (with other clothes).

What algorithms can I use in order to recognize those words?

One possible solution is to filter words with <=3 letters. Most likely those are the attributes. But this is a rather naive approach.

I use JavaScript, but I'm also grateful for pseudocode.

This is only a small extract of the list and the possible categories.

Update:

I do not know the attributes beforehand. The computer should extract them without my help.

Can you please make an example of output you want to get from the above item list? — Fabius, Mar 06 '16 at 10:39
If you cant provide an array of possible *main words* and one of possible *attributes* to test the full string against, checking for `<=3` seems the only way to distinguish. You might use some (php?) code to parse the whole database that way and then "manually" check it to find exceptions (and fix them). — Fabius, Mar 06 '16 at 11:03
If you don't know in advance what attributes may appear, what should happen with a string `FEE FOO FIE FUM`? — Jongware, Mar 06 '16 at 11:18
@Fabius: Could you please give an example of what to parse and check the DB for? — Evgenij Reznik, Mar 06 '16 at 11:19
@RadLexus: If any of those words appear frequently and together with other words, chances are, those are attributes. — Evgenij Reznik, Mar 06 '16 at 11:19
That is a good first step actually. Then: read the entire file, make a counting word list, and use the most frequently used words *only*. Minding that (apparently) every line *must* have at least one 'good' word. — Jongware, Mar 06 '16 at 11:25
Your products are stored in a database table. So with few lines of php code you can query the database to get the "product name" column, parse each product name by splitting it into an array separated by spaces (" ") and output two arrays ("mainwords" and "attributes") basing on the simple condition that each element of the product name array has lenght <=3 (adding it to it's proper array "mainwords" or "attributes" only if it wasnt yet added). — Fabius, Mar 06 '16 at 11:28
Building on @RadLexus's idea, you could order all words by increasing frequency of appearance to produce wordsSortedByFreq[1 .. nWords], and then choose the threshold 1 <= t <= nWords using the following idea: Choose t so that every product has *exactly one* of the words in wordsSortedByFreq[1 .. t]. ... — j_random_hacker, Mar 06 '16 at 11:42
the only solution to this is to create a matrix, that indicates which item comes with which item in each order, so some useful data can be extracted from that. once this is done, the next thing would be to do some data mining on the basis of stats, to identify what is what. — Ammar Hasan, Mar 06 '16 at 11:42
... That would be ideal, but maybe no such t exists: in that case, you could choose the smallest t such that every product has at least one word in wSBF[1 .. t], or the largest t such that every product has at most one word in wSBF[1 .. t]. The latter two t values give you a nice (and hopefully small) range of thresholds for you to manually consider, in any case. — j_random_hacker, Mar 06 '16 at 11:43
Make 3 sweeps with pattern-matching, first for the clothing-item, second for the size , third for any other attributes. Use these to populate an array of record-objects, which should be standardised ie. Database quality. — Arif Burhan, Mar 06 '16 at 12:33

j_random_hacker · Answer 1 · 2016-03-07T08:15:29.417

Finding Main Words is the Exact Cover Problem

Let's assume that every word is either a main word or an attribute word -- that is, there are no words that are sometimes used as main words, and sometimes as attributes. Then the problem is to classify every word as either main or attribute -- or equivalently, to determine the subset of all words that are main words (since the rest must be the attribute words). We don't know too much about attribute words, except that they should be pretty common. But we know something more definite about the subset of words that are main words: every product must contain exactly one word in this subset.

This problem can be modeled as finding an exact cover, where the ground set (the set of items that need to be "covered") is the set of all products, and every word gives us a set that we can potentially use to cover some of these ground-set elements: namely, the set of products that use that word. A solution to this problem will be a subset of words whose corresponding product sets together include every product exactly once -- that is, it will be a candidate set of main words.

Unfortunately, finding an exact cover is an NP-hard problem, so no efficient algorithms are known that will solve this problem exactly.

Formulation as Integer Linear Program

Nevertheless it's often the case that either an exact answer, or a good-quality heuristic answer, can be found by expressing the problem as an integer linear program, and using an ILP solver, such as the freely available lp_solve or the expensive (but much faster) CPLEX, to solve the problem. There are two approaches:

Solve the problem exactly using integer LP: This can take a long time, but guarantees to find an exact solution if one exists.
Solve the problem heuristically using ordinary, continuous LP: This will take a fairly short time, even for large input sizes, but will not usually find a feasible solution. Instead of being assigned either 1 or 0, most variables will be assigned a value somewhere in the range [0, 1] by the solver, which can often be interpreted as a confidence score, and simply rounded to 0 or 1. Doing this doesn't guarantee an optimal or even feasible solution, but it's often possible to obtain a good solution by performing this rounding and then "fixing up" any feasibility violations (specifically, any products that have no assigned main word, or [EDIT 7/3/2016] any product that contains two or more main words) with simple local hill-climbing search heuristics.

Conveniently, changing from one to the other is as simple as specifying whether the variables are constrained to be integers or not; the objective function and all the constraints don't change.

Suppose there are n distinct words, and m products. To formulate this problem as an (I)LP, we will need a variable x_j for each distinct word j, which the solver will assign the value 1 if the j-th word should be considered a main word, or 0 if it should be considered an attribute word (or something in between if we are only solving the continuous relaxation). Let p_i_j be 1 if product i uses word j (this information is present in the input, and can be represented as a matrix). Then for each product i, we need the constraint

p_i_1*x_1 + p_i_2*x_2 + ... + p_i_n*x_n = 1.

This forces product i to have exactly one main word. (Most words won't be present in a given product, so the corresponding term in the above expression will have a 0 value for p_i_j and can be omitted; doing this will reduce the size of the problem description considerably.)

Those constraints are all we actually need, but ILPs also give us an objective function to minimise or maximise. So we could for example find the largest (or smallest) set of main words by trying to maximise (or minimise) the function

x_1 + x_2 + ... + x_n

This formulation is pretty flexible: We can easily weight some words more highly, or prevent some words (e.g. words that appear too frequently) from being considered as main words by constraining their x_j values to 0.

Example

Let's actually calculate a word set for the example data fragment provided.

The distinct words are:

 1    BLACK
 2    BUS
 3    CAS
 4    FEM
 5    GRE
 6    JACKET
 7    L
 8    M
 9    MA
10    RED
11    SHIRT
12    SHOES
13    SPO
14    XS

[EDIT 7/3/2016]: We can express the ILP problem in CPLEX LP format, adding an objective function that (somewhat arbitrarily) minimises the total number of words chosen as main words:

Minimize
  x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10 + x_11 + x_12 + x_13 + x_14
Subject To
  x_13 + x_11 + x_7 = 1
  x_3 + x_11 + x_8 + x_10 = 1
  x_3 + x_11 + x_7 + x_10 = 1
  x_2 + x_11 + x_14 + x_10 = 1
  x_3 + x_12 = 1
  x_12 + x_1 = 1
  x_6 + x_4 + x_8 = 1
  x_6 + x_4 + x_5 = 1
  x_3 + x_6 + x_9 + x_10 = 1
Binary
  x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13 x_14

Submitting a text file containing the above code block to the online SCIP solver at NEOS, we get back (reportedly in 0.00s time!) an optimal solution which assigns the 3 variables x_6, x_11 and x_12 to 1, and the rest to 0: this corresponds to choosing JACKET, SHIRT and SHOES as main words.

In this case, choosing to minimise the number of words chosen as main words got us the answer we know to be correct, but in general there can be multiple sets of main words that satisfy the criteria, and it's not obvious how to choose which is best. For example, if we had decided to maximise the number of main words instead (as I tried initially, FWIW), we could have gotten the solution {BLACK, BUS, CAS, FEM, SPO} instead. In general though, the more input that is provided, the fewer satisfying sets of words there will be, and so the less the choice of objective function will matter.

Finding Attribute Categories is the Graph Colouring Problem

After determining which words are main words, we want to break the remaining attribute words up into categories. This problem is even harder: Essentially all we can say is that whenever two words are from the same category, like CAS and BUS, they must not both appear in the same product. But if that is the only constraint we apply, then an easy solution is just to put every single attribute word into its own category. Of course that doesn't tell us anything useful, so to avoid this possibility, we can ask that the total number of categories be as small as possible.

What we then have is the (Minimum) Graph Colouring problem: For each attribute word, create a vertex, and create an edge between any two vertices whenever their words both appear in the same product. A graph colouring is an assignment of "colours" (or equivalently, simply categories) to the vertices (words) that always assigns different colours to the two endpoints of an edge. A minimum such colouring uses the fewest possible colours (categories).

There are several different ways to formulate a Graph Colouring problem as an ILP. You could use the one given here. Note that there will be many more variables than in the Exact Cover ILP, and with some formulations there will be as many as k! different ways to encode a particular breakdown into k categories (essentially, as many category assignments as there are ways to order k items), so solving this problem to optimality will likely take much longer.

Thank you. There is much to play around with. Could you please post the whole file you uploaded at NEOS? If I just upload the code you provided above, I get much output, but it mostly consists of 0's. — Evgenij Reznik, Mar 06 '16 at 23:10
Glad it helped! I've now updated that code block to be the complete LP-format file, which I've tested on the NEOS server. The NEOS server seems to return a lot of output, but you can look for "Solution status: optimal solution found", below which the variables that were assigned 1 are listed. (And I've now deleted my original comment response, which just contained the same info.) — j_random_hacker, Mar 07 '16 at 08:16

Ammar Hasan · Answer 2 · 2016-03-06T12:19:20.793

UPDATE:

Remember: In such cases, the more the data is the more accurate an algorithm becomes. But if you only have few rows with many attributes, then almost every algorithm will fail. Also the >=3 thing will fail for BLACK and it the product name is TIE, so we will have to look for a good algorithm.

Now here's my take on the problem

Parse the data and create a matrix of each unique word for each item against other unique words. This will give the stats that which word is used in which order placement, and is related to which words in that order. This will also give the stats about how frequently a word is used in an order.
I would suggest to create or (import from somewhere) a list of the human identifiable color names, because color names are global, and if its blue, then one would write it BLUE. This would help identify the colors.
Once colors are identified, we can probably look for gender. Of course, if one is a FEMALE, probability of writing it would be F or FEM or FEMALE. So we will have to find the closest match in for the gender name by any string matching algorithm.
Now target <=2 character words, which will give you the sizes.
Afterwards look for the most frequent words from the remaining words, frequency of occurrence of each frequency would be fairly high (because there are limited categories), so set a threshold of minimum occurrences to identify something as a category.
At last the remaining words would be the item names (Shirt, shoes, etc)

The only error that could occur here is when the data is very less or the frequency threshold is not optimized correctly.

MY_PREVIOUS_ANSWER: (when i thought that you just have to parse the input, but this still can be used to create the matrix).

I think you can work it out by creating maps and iterating over them, like

(function () {
  "use strict";
  const text = `SPO SHIRT L 
  CAS SHIRT M GRE
  CAS SHIRT L RED
  BUS SHIRT XS RED
  
  CAS SHOES
  SHOES BLACK
  
  JACKET FEM M
  JACKET FEM GRE
  CAS JACKET MA RED`;
  
  const tokens = {
    size_list: "XS, S, M, L, XL".split(', '),
    color_list: "GREEN, RED, BLUE, BLACK".split(', ').map((c) => c.slice(0, 3)),
    category_list: "SPORT, BUSINESS, CASUAL".split(', ').map((c) => c.slice(0, 3)),
    gender_list: ["MA", "FEM", "UNI"],
    item_list: "SHOES, SHIRT, JACKET".split(', ')
  };
  
  let items = text.split(/\n/).filter((line) => !!line.trim()).map((line) => {
    line = line.trim();
    let item = {};
    line.split(' ').forEach((w) => {
      Object.keys(tokens).forEach((t) => {
        tokens[t].includes(w) && (item[t.replace('_list', '')] = w);
      });
    });
    return item;
  })
  
  document.write(JSON.stringify(items, 0, 4));
}());

Thank you. But I don't know all the attributes beforehand. I just provide a list and the computer should do the work (recognize the attributes and main words). — Evgenij Reznik, Mar 06 '16 at 10:57
i think you can achieve something close, but not exactly what you're looking for, i'm working on it now — Ammar Hasan, Mar 06 '16 at 11:15
although i was working on the algo code, but it was taking fair amount of time, so i'll just tell you my algorithm, please review my answer, i've edited it with algorithm @user1170330 — Ammar Hasan, Mar 06 '16 at 12:15

Recognize different word classes

2 Answers2

Finding Main Words is the Exact Cover Problem

Formulation as Integer Linear Program

Example

Finding Attribute Categories is the Graph Colouring Problem