Finding Main Words is the Exact Cover Problem
Let's assume that every word is either a main word or an attribute word -- that is, there are no words that are sometimes used as main words, and sometimes as attributes. Then the problem is to classify every word as either main or attribute -- or equivalently, to determine the subset of all words that are main words (since the rest must be the attribute words). We don't know too much about attribute words, except that they should be pretty common. But we know something more definite about the subset of words that are main words: every product must contain exactly one word in this subset.
This problem can be modeled as finding an exact cover, where the ground set (the set of items that need to be "covered") is the set of all products, and every word gives us a set that we can potentially use to cover some of these ground-set elements: namely, the set of products that use that word. A solution to this problem will be a subset of words whose corresponding product sets together include every product exactly once -- that is, it will be a candidate set of main words.
Unfortunately, finding an exact cover is an NP-hard problem, so no efficient algorithms are known that will solve this problem exactly.
Formulation as Integer Linear Program
Nevertheless it's often the case that either an exact answer, or a good-quality heuristic answer, can be found by expressing the problem as an integer linear program, and using an ILP solver, such as the freely available lp_solve or the expensive (but much faster) CPLEX, to solve the problem. There are two approaches:
- Solve the problem exactly using integer LP: This can take a long time, but guarantees to find an exact solution if one exists.
- Solve the problem heuristically using ordinary, continuous LP: This will take a fairly short time, even for large input sizes, but will not usually find a feasible solution. Instead of being assigned either 1 or 0, most variables will be assigned a value somewhere in the range [0, 1] by the solver, which can often be interpreted as a confidence score, and simply rounded to 0 or 1. Doing this doesn't guarantee an optimal or even feasible solution, but it's often possible to obtain a good solution by performing this rounding and then "fixing up" any feasibility violations (specifically, any products that have no assigned main word, or [EDIT 7/3/2016] any product that contains two or more main words) with simple local hill-climbing search heuristics.
Conveniently, changing from one to the other is as simple as specifying whether the variables are constrained to be integers or not; the objective function and all the constraints don't change.
Suppose there are n distinct words, and m products. To formulate this problem as an (I)LP, we will need a variable x_j for each distinct word j, which the solver will assign the value 1 if the j-th word should be considered a main word, or 0 if it should be considered an attribute word (or something in between if we are only solving the continuous relaxation). Let p_i_j be 1 if product i uses word j (this information is present in the input, and can be represented as a matrix). Then for each product i, we need the constraint
p_i_1*x_1 + p_i_2*x_2 + ... + p_i_n*x_n = 1.
This forces product i to have exactly one main word. (Most words won't be present in a given product, so the corresponding term in the above expression will have a 0 value for p_i_j and can be omitted; doing this will reduce the size of the problem description considerably.)
Those constraints are all we actually need, but ILPs also give us an objective function to minimise or maximise. So we could for example find the largest (or smallest) set of main words by trying to maximise (or minimise) the function
x_1 + x_2 + ... + x_n
This formulation is pretty flexible: We can easily weight some words more highly, or prevent some words (e.g. words that appear too frequently) from being considered as main words by constraining their x_j values to 0.
Example
Let's actually calculate a word set for the example data fragment provided.
The distinct words are:
1 BLACK
2 BUS
3 CAS
4 FEM
5 GRE
6 JACKET
7 L
8 M
9 MA
10 RED
11 SHIRT
12 SHOES
13 SPO
14 XS
[EDIT 7/3/2016]: We can express the ILP problem in CPLEX LP format, adding an objective function that (somewhat arbitrarily) minimises the total number of words chosen as main words:
Minimize
x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10 + x_11 + x_12 + x_13 + x_14
Subject To
x_13 + x_11 + x_7 = 1
x_3 + x_11 + x_8 + x_10 = 1
x_3 + x_11 + x_7 + x_10 = 1
x_2 + x_11 + x_14 + x_10 = 1
x_3 + x_12 = 1
x_12 + x_1 = 1
x_6 + x_4 + x_8 = 1
x_6 + x_4 + x_5 = 1
x_3 + x_6 + x_9 + x_10 = 1
Binary
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13 x_14
Submitting a text file containing the above code block to the online SCIP solver at NEOS, we get back (reportedly in 0.00s time!) an optimal solution which assigns the 3 variables x_6, x_11 and x_12 to 1, and the rest to 0: this corresponds to choosing JACKET, SHIRT and SHOES as main words.
In this case, choosing to minimise the number of words chosen as main words got us the answer we know to be correct, but in general there can be multiple sets of main words that satisfy the criteria, and it's not obvious how to choose which is best. For example, if we had decided to maximise the number of main words instead (as I tried initially, FWIW), we could have gotten the solution {BLACK, BUS, CAS, FEM, SPO} instead. In general though, the more input that is provided, the fewer satisfying sets of words there will be, and so the less the choice of objective function will matter.
Finding Attribute Categories is the Graph Colouring Problem
After determining which words are main words, we want to break the remaining attribute words up into categories. This problem is even harder: Essentially all we can say is that whenever two words are from the same category, like CAS and BUS, they must not both appear in the same product. But if that is the only constraint we apply, then an easy solution is just to put every single attribute word into its own category. Of course that doesn't tell us anything useful, so to avoid this possibility, we can ask that the total number of categories be as small as possible.
What we then have is the (Minimum) Graph Colouring problem: For each attribute word, create a vertex, and create an edge between any two vertices whenever their words both appear in the same product. A graph colouring is an assignment of "colours" (or equivalently, simply categories) to the vertices (words) that always assigns different colours to the two endpoints of an edge. A minimum such colouring uses the fewest possible colours (categories).
There are several different ways to formulate a Graph Colouring problem as an ILP. You could use the one given here. Note that there will be many more variables than in the Exact Cover ILP, and with some formulations there will be as many as k! different ways to encode a particular breakdown into k categories (essentially, as many category assignments as there are ways to order k items), so solving this problem to optimality will likely take much longer.