2

I have an array of 800 sentences. I want to remove all duplicates (sentences that have the same exact words, but in different order) from the array. So for example "this is a sentence" and "is this a sentence" are duplicates. Only one one of them should remain in the array (it doesn't matter which one).

My first idea was to copy them one by one to a new array, each time checking to see if the sentence already exists in the new array. I would accomplish this by looping through all the elements in the new array and using the following code for comparing the sentences:

Using jQuery to compare two arrays of Javascript objects

However, this quickly becomes too intensive in terms of calculations and causes the javascript engine to be nonresponsive.

Any ideas on how to make the algorithm more efficient would be appreciated.

Community
  • 1
  • 1
stepanian
  • 11,373
  • 8
  • 43
  • 63
  • If your main concern is responsiveness, you may consider doing this (pseudo)asynchronously. The ideal tool for that would be HTML5's web-workers, but if you need wider browser support, there is a classic trick: use a setInterval() to trigger a function that does part of the work. With either of these approaches, the process will actually take longer, but the browser will be fully responsive in the meanwhile. – Edurne Pascual Mar 31 '11 at 22:13
  • Does the order of the sentences matter? – Peter Olson Mar 31 '11 at 22:16
  • Thanks. That would help to keep the browser responsive, but it will still take a very long time. Plus, for the app functionality, it is required for this to be synchronous. I was looking for a way to make this more efficient if that is even possible. – stepanian Mar 31 '11 at 22:20
  • @Peter - no the order does not matter. Sorting them is not computationally intensive, so I can sort it later. – stepanian Mar 31 '11 at 22:21

6 Answers6

3

Use an Object as a lookup to get a quick hashtable-backed check. That means using string as your key type, which means normalising the case/ordering/etc of the words first to get a unique key for each combination of words.

// Get key for sentence, removing punctuation and normalising case and word order
// eg 'Hello, a  horse!' -> 'x_a hello horse'
// the 'x_' prefix is to avoid clashes with any object properties with undesirable
// special behaviour (like prototype properties in IE) and get a plain lookup
//
function getSentenceKey(sentence) {
    var trimmed= sentence.replace(/^\s+/, '').replace(/\s+$/, '').toLowerCase();
    var words= trimmed.replace(/[^\w\s]+/g, '').replace(/\s+/, ' ').split(' ');
    words.sort();
    return 'x_'+words.join(' ');
}

var lookup= {};
for (var i= sentences.length; i-->0;) {
    var key= getSentenceKey(sentences[i]);
    if (key in lookup)
        sentences.splice(i, 1);
    else
        lookup[key]= true;
}

Would need some work if you need to support non-ASCII characters (\w doesn't play well with Unicode in JS, and the question of what constitutes a word in some languages is a difficult one). Also, is "foo bar foo" the same sentence as "bar bar foo"?

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Very interesting! Let me try to understand it first and then try it. Thanks! – stepanian Mar 31 '11 at 22:27
  • I like it. Each sentence is sorted only once and duplicate checks are constant time. – RandomEngy Mar 31 '11 at 22:30
  • They are all ASCII. No, "foo bar foo" is not the same sentence as "bar bar foo". They have to be the same count of each word. – stepanian Mar 31 '11 at 22:31
  • I am sure you have this covered, but there is no conflict there between the splicing of the sentences array within the loop and using the sentences.length as the condition, right? My brain is tired... – stepanian Mar 31 '11 at 22:59
  • The `i-->0` is a reverse-iterator idiom. Because a splice only affects the right-hand part of the list, moving the index leftwards avoids jumping over any elements when a splice occurs. – bobince Mar 31 '11 at 23:03
  • Sorry, I never got to try your solution, Climbage's solution was simpler, so I tried it first. I am sure your solution would work too. I didn't need the complexity of deleting the punctuation because the sentences would not have any (I neglected to mention that as part of the question). Thank you very much! – stepanian Mar 31 '11 at 23:56
3

How about this?

va = ["this is a sentence", "sentence this is", "sentence this is a"]
vb = {} // dictionary of combined sorted words in each sentence
vc = [] // output list of sentences without duplicates 

for (i=0;i<va.length;i++){
    // split sentence, sort words, and recombine (this is a sentence => a is sentence this)
    var combined = va[i].split(" ").sort().join(" "); 

    if (!vb[combined]){       // if set of combined sorted words doesn't exist already
        vc.push(va[i]);      // sentence isn't duplicated, push to output list
        vb[combined] = true  // add set to dictionary
    }
}

alert(vc.join("\n"))
Mike Park
  • 10,845
  • 2
  • 34
  • 50
2

Here's something to try. I didn't test its performance on large arrays, but I think it should be ok. No jQuery needed.

function removeDuplicates(array)
{
    var new_array = [];
    for(var i=0; i<array.length; i++){
        // convert current sentence to sorted, lowercase string
        var sen = array[i].split(" ").sort().join(" ");
        if(new_array.indexOf(sen) == -1){
            // no matches, let's add it!
            new_array.push(sen);
        }
    }
    return new_array;
}

Array.prototype.indexOf = function(item, optional_start_index)
{
    for(var i=optional_start_index||0; i<this.length; i++){
        if(this[i] == item) return i;
    }
    return -1;
}

Use it like this:

var a = ["this is a name", "name is this a", "this name is a", "hello there"];
var clean_array = removeDuplicates(a);
alert(clean_array); // outputs: a is name this,hello there
Makram Saleh
  • 8,613
  • 4
  • 27
  • 44
  • -1 Extending the native prototype is trouble. Don't please. Your breaking my `for in` loops. – Raynos Mar 31 '11 at 22:24
  • I will also take into consideration Raynos's comment. – stepanian Mar 31 '11 at 22:30
  • 1
    Extending the `Array` prototype is perfectly legit for `indexOf` because that is a standardised ECMAScript Fifth Edition method: you aren't defining your own, potentially-clashing behaviour, you're just bringing older browsers up to speed. However, if you do this, you should check for the pre-existance of a native `indexOf` first, and really you should implement the whole of the `indexOf` interface including the optional start-index argument and the use of `===` comparison. (Plus there is an obscure distinction about `undefined` values.) @Raynos: never use `for...in` over an `Array`. – bobince Mar 31 '11 at 22:33
  • Thanks @bobince for the clarification! I adjusted the prototype function to support an `optional_start_index`. – Makram Saleh Mar 31 '11 at 22:41
  • @bobince it [breaks](http://jsfiddle.net/kPB4e/1/) my for loops. I like using `for...in` on my Arrays when I treat them as an unordered set. Or I like iterating over an object without having to worry about whether I implemented it as an Array. I would recommend defining it statically on `Array.indexOf` or using the method defined on jQuery or whatever other library you have on the page. – Raynos Mar 31 '11 at 22:44
1
Luca Filosofi
  • 30,905
  • 9
  • 70
  • 77
  • Thank you for your answer. Are you suggesting that the method you mentioned for comparing two strings is more efficient than the jQuery method I am using? Do you believe that's where the inefficiency is? My assumption was that it was the sheer number of comparisons that was causing the slowdown. It seems to me that the method you linked may actually be more intensive that the jQuery method, but I will try it anyway. Thanks. – stepanian Mar 31 '11 at 22:19
1

Sort the array of sentences, and then loop through it and delete an item if it is the same as the previous one:

texts.sort();
for(var i = 1; i < texts.length; i++){
    if(texts[i] === texts[i-1]){
        texts.splice(i,1);
        i--;
     }
}

I tested this in an array with 800 strings, and it seemed reasonably fast.

EDIT: Sorry, didn't read your question very carefully

Peter Olson
  • 139,199
  • 49
  • 202
  • 242
1

This is a very simple implementation that takes advantage of some jQuery.

Check the demo here ->

And the source:

var arr = ["This is a sentence", "Is this a sentence", "potatoes"];
var newArr = [];
var sortedArr = [];
$.each(arr, function(i) {
    var temp = this.toLowerCase().split(" ").sort(function(a,b) {
            return a > b;
    }).join(' ');
    if ($.inArray(temp, sortedArr) == -1) {
        sortedArr.push(temp);
        newArr.push(arr[i]);   
    }
});

//output
$.each(newArr, function() {
    document.write(this + '<br />');
});

It uses three arrays: a source, a collection of sorted sentences to match against, and the output array. Matching is performed by splitting a sentence by spaces, converting to lowercase, and sorting the words alphabetically, then rebuilding the sentence string. If that particular combo has been seen before, it is not added to the results. If it has not, it is added.

The loop at the end simply outputs the resultant array.

Ender
  • 14,995
  • 8
  • 36
  • 51