Intersecting texts to find common words

Question

I'm trying to find out which would be the most optimal way of intersection a set of texts and find the common words in them. Given this scenario:

var t1 = 'My name is Mary-Ann, and I come from Kansas!';
var t2 = 'John, meet Mary, she comes from far away';
var t3 = 'Hi Mary-Ann, come here, nice to meet you!';

intersection result should be:

var result =["Mary"];

It should be able to ignore punctuation marks like .,!?-

Would a solution with regular expressions be optimal?

Denys Séguret · Accepted Answer · 2014-05-28T16:15:46.097

Here's a tested solution :

function intersect() {
   var set = {};
   [].forEach.call(arguments, function(a,i){
     var tokens = a.match(/\w+/g);
     if (!i) {
       tokens.forEach(function(t){ set[t]=1 });
     } else {
       for (var k in set){
         if (tokens.indexOf(k)<0) delete set[k];
       }
     }
   });
   return Object.keys(set);
}

This function is variadic, you can call it with any number of texts :

console.log(intersect(t1, t2, t3)) // -> ["Mary"] 

console.log(intersect(t1, t2)) // -> ["Mary", "from"] 

console.log(intersect()) // -> []

If you need to support non English languages, then this regex won't be enough because of the poor support of Unicode in JavaScript regexes. Either you use a regex library or you define your regex by explicitly excluding characters as in a.match(/[^\s\-.,!?]+/g); (this will probably be enough for you) .

Detailed explanation :

The idea is to fill a set with the tokens of the first text and then remove from the set the tokens missing in the other texts.

The set is a JavaScript object used as a map. Some purists would have used Object.create(null) to avoid a prototype, I like the simplicity of {}.
As I want my function to be variadic, I use arguments instead of defining the passed texts as explicit arguments.
arguments isn't a real array, so to iterate over it you need either a for loop or a trick like [].forEach.call. It works because arguments is "array-like".
To tokenize, I simply use match to match words, nothing special here (see note above regarding better support of other languages, though)
I use !i to check if it's the first text. In that case, I simply copy the tokens as properties in the set. A value must be used, I use 1. In the future, ES6 sets will make the intent more obvious here.
For the following texts, I iterate over the elements of the sets (the keys) and I remove the ones which are not in the array of tokens (tokens.indexOf(k)<0)
Finally, I return the elements of the sets because we want an array. The simplest solution is to use Object.keys.

"What does [].forEach.call() does in JavaScript?" http://stackoverflow.com/questions/16053357/what-does-foreach-call-does-in-javascript — Matthew Lock, May 28 '14 at 08:30
@MatthewLock Do you mean I should have explained that point ? — Denys Séguret, May 28 '14 at 08:31
It could have been one statement and a little less "verbose" with `reduce` instead of `forEach` but I prefer it to be more readable. — Denys Séguret, May 28 '14 at 08:34
@MatthewLock If there's another step that's unclear, please ask. — Denys Séguret, May 28 '14 at 08:34
The whole thing is unclear to me, could you explain it all? I'm sure others would find that helpful too. — Matthew Lock, May 28 '14 at 08:35

Intersecting texts to find common words

1 Answers1

Linked