Group strings by longest common starting substring

Question

Here's the problem. Say I have these strings:

apple ipad mini 32gb
apple ipad mini 64gb
apple ipad air 64gb
apple ipad air 32gb
panasonic gh4
samsung s2 galaxy
samsung s2 galaxy red
samsung s3 galaxy

I want these strings to be grouped like this:

apple ipad mini: [apple ipad mini 32gb, apple ipad mini 64gb]
apple ipad air: [apple ipad air 64gb, apple ipad 32gb]
panasonic gh4: [panasonic gh4]
samsung s2 galaxy: [samsung s2 galaxy, samsung s2 galaxy red]
samsung s3 galaxy

The point is to separate name of the item from its attributes(color, memory capacity etc).

I used this algorithm for finding longest common substring: link

Can you guys share your ideas? No code or implementation needed. Thank you.

Edited:

    this.data = _.sortBy(this.data, function(item) {
        return item.title;
    });

    var i = 0;
    var groups = {};
    var len = this.data.length - 1;
    while(i < len) {
        var key = this.lcs(this.data[i][this.attr], this.data[i+1][this.attr]) || this.data[i][this.attr];
        groups[key] = true;
        i++;
        while(this.data[i][this.attr].startsWith(key) && i < len) {
            i++;
        }
    }
    console.log(groups)

This works great(tested only adding keys). But I want to add samsung s3 galaxy to list too. Thanks for help guys!

are you wanting to debate the merits of the longest common substring method over others? What ideas are you looking for? Alternatives? — jusopi, Dec 10 '15 at 19:36
Im still working on solution, changed the code for like 40 times. Im asking for general pattern or an approach for this problem. thank you for response. — Alex Shevchenko, Dec 10 '15 at 19:37
If you can assume that the first set of text is the *make* (e.g. apple, panasonic, etc.) then you can at least reduce your reliance on the *longest common substring* implementation by one order of magnitude. If you can assume that the 2nd set of text is the "model" then you can do so further... the issue arises when you get these strings that don't conform to the "make, model, details" order. — jusopi, Dec 10 '15 at 19:40
I'd also rephrase your question as it's rather open and doesn't clearly direct those trying to answer the question as to what exactly your problem, if any, is. — jusopi, Dec 10 '15 at 19:42
another kink in this is: what if you had "panasonic gh4 red" where the make and model is only given 2 sets of text vs the others who get 3 — jusopi, Dec 10 '15 at 19:44
Actually, they dont go in order like "make, model, details". These are items I parse from online shops so theres no specific order. Which part wasnt unclear? Im up to rephrase it. Thanks. — Alex Shevchenko, Dec 10 '15 at 19:49

Nina Scholz · Answer 1 · 2015-12-10T21:13:08.057

An attempt to solve the problem with comparing two strings with same words and a look up if the length of the words is smaller then the previous path.

function groupObject(i, l) {
    return { item: i, length: l };
}

function group(r, a, i, o) {
    var rr = r.item.split(' '),
        aa = a.split(' '),
        j = 0,
        key, keys = [];

    while (aa[j] === rr[j]) {
        keys.push(aa[j]);
        j++;
    }
    if (keys.length < r.length && i < o.length - 1) {
        return group(groupObject(o[i + 1], 0), a, Number.MAX_VALUE, o);
    }
    key = keys.join(' ');
    if (!key || keys.length < r.length && i === o.length - 1) {
        key = a;
    }
    grouped[key] = grouped[key] || [];
    grouped[key].push(a);
    return groupObject(a, keys.length);
}

var data = ['apple ipad mini 32gb', 'apple ipad mini 64gb', 'apple ipad air 64gb', 'apple ipad air 32gb', 'panasonic gh4', 'samsung s2 galaxy', 'samsung s2 galaxy red', 'samsung s3 galaxy'],
    grouped = {};

data.reduce(group, groupObject(data[1], 0));
document.write('<pre>' + JSON.stringify(grouped, 0, 4) + '</pre>');

Thank you, Nina. Great job, helped a lot. Is it possible to list Samsung s3 galaxy too as separate key? I mean add all items that have no match to list(though it has prefix 'samsung', but its different item). Another question: how can I upvote you guys? — Alex Shevchenko, Dec 10 '15 at 20:50

גלעד ברקן · Accepted Answer · 2015-12-10T21:45:34.887

1

If you just want to simply group by longest common prefix (that means "apple ipad mini" will be chosen even though "apple ipad" would yield a larger group), then maybe something like this?

sort the list
i = 0
while i < end of list:
  key = longest common prefix of list[i] & list[i + 1]
        or list[i] if the common prefix is less than (1?) words or i is the last index
  groups[key] = list[i++]
  while key is prefix of list[i]:
    add list[i++] to groups[key]

edited Dec 10 '15 at 21:45

answered Dec 10 '15 at 19:59

גלעד ברקן

23,602
3
25
61

Brilliant idea. I havent considered sorting. Is it possible to list Samsung s3 galaxy too as separate key? I mean add all items that have no match to list(though it has prefix 'samsung', but its different item). – Alex Shevchenko Dec 10 '15 at 20:48
@AlexShevchenko according to the algorithm I suggested, "Samsung s3 galaxy," would generate it's own key - because when `i` is the last index, "Samsung s3 galaxy" would not have the current key as a prefix (the current key would be "Samsung s2 galaxy"). (By the way, to vote, click the little arrows above or below the vote score.) – גלעד ברקן Dec 10 '15 at 21:40

Group strings by longest common starting substring

2 Answers2

Linked