8

I'm writing something that takes a block of text and breaks it down into possible database queries that could be used to find similar blocks of text. (something similar to the "similar questions" list being generated while I type this) The basic process:

  1. Remove stop words from text
  2. Remove special characters
  3. From remaining text create an array of unique "stems"
  4. Create an array of possible combinations of the array of stems (where I'm stuck... kind of)

Here's what I have so far:

    //baseList starts with an empty array
    //candList starts with the array of unique stems
    //target is where the arrays of unique combinations are stored

    function createUniqueCombos(baseList,candList,target){

    for(var i=0;i<candList.length;i++){         

        //copy the base List
        var newList = baseList.slice(0);

        //add the candidate list item to the base list copy
        newList.push(candList[i]);

        //add the new array to the target array
        target.push(newList);   

        //re-call function using new array as baseList
        //and remaining candidates as candList
        var nextCandList = candList.slice(i + 1);       
        createUniqueCombos(newList,nextCandList,target);
    }

}

This works, but on blocks of text larger than 25 words or so, it crashes my browser. I realize that mathematically there could be a huge number of possible combinations. What I'd like to know is:

  1. Is there a more efficient way to do this?
  2. How could I define a min/max combination array length?
Tomer
  • 17,787
  • 15
  • 78
  • 137
HartyeTech
  • 181
  • 5
  • This is a fantastic first question. Welcome to StackOverflow! Your browser will likely be crashing from the amount of memory used, or recursing too much. – Bojangles Jul 10 '12 at 13:54
  • Do you really need all combinations at once? Can't you process them instantly as you generate them instead of amassing huge array? Also try to rewrite your algorithm to iteration instead of recursion. – Oleg V. Volkov Jul 10 '12 at 13:57
  • Thanks, I've been a spectator for quite some time ;) @OlegV.Volkov No, I don't need all combinations I'd like to be able to define a min/max length for the returned combination arrays. Thanks for the iteration suggestion. – HartyeTech Jul 10 '12 at 14:06

3 Answers3

1

I think your logic is fundamentally flawed because of how many combinations you're creating.

An approach I'd take would be;

  1. Split the text into individual words (we'll call this variable split_words)
  2. Remove special characters
  3. Remove short/ common words (and, or, I, a); either do this by length, or more intelligently by a black list of words
  4. Have a table (e.g. blocks) which has columns block_id and word
  5. Have a SQL query such as

    SELECT block_id FROM blocks 
    WHERE word IN (split_words) GROUP BY block_id 
    ORDER BY COUNT(*) DESC
    

and then you'll have a list of block_ids which are ordered depending on how many words in common the blocks have.

Matt
  • 74,352
  • 26
  • 153
  • 180
  • Thanks for the reply. I'm already doing 1, 2, and 3 before it gets to this step. I'm dealing with a proprietary platform and database technology on the server side, and implementing a solution like you're suggesting is something I've considered. Unfortunately, breaking down the data I'll be searching into individual words won't be possible. – HartyeTech Jul 10 '12 at 14:14
1

Found this previous question: Algorithm to find articles with similar text

One of the answers provided a link to an article that suggests finding how many adjacent character pairs are contained in both strings. [ http://www.catalysoft.com/articles/StrikeAMatch.html ]

The example is in Java but I'm sure can be ported easily to JS:

/** @return an array of adjacent letter pairs contained in the input string */
private static String[] letterPairs(String str) {
   int numPairs = str.length()-1;
   String[] pairs = new String[numPairs];
   for (int i=0; i<numPairs; i++) {
       pairs[i] = str.substring(i,i+2);
   }
   return pairs;
}

/** @return an ArrayList of 2-character Strings. */
private static ArrayList wordLetterPairs(String str) {
   ArrayList allPairs = new ArrayList();
   // Tokenize the string and put the tokens/words into an array
   String[] words = str.split("\\s");
   // For each word
   for (int w=0; w < words.length; w++) {
       // Find the pairs of characters
       String[] pairsInWord = letterPairs(words[w]);
       for (int p=0; p < pairsInWord.length; p++) {
           allPairs.add(pairsInWord[p]);
       }
   }
   return allPairs;
}

/** @return lexical similarity value in the range [0,1] */
public static double compareStrings(String str1, String str2) {
   ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());
   ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());
   int intersection = 0;
   int union = pairs1.size() + pairs2.size();
   for (int i=0; i<pairs1.size(); i++) {
       Object pair1=pairs1.get(i);
       for(int j=0; j<pairs2.size(); j++) {
           Object pair2=pairs2.get(j);
           if (pair1.equals(pair2)) {
               intersection++;
               pairs2.remove(j);
               break;
           }
       }
   }
   return (2.0*intersection)/union;
}
Community
  • 1
  • 1
mcknz
  • 467
  • 9
  • 27
  • This is very cool. What I'm trying to do is "cast a net" to find other "articles" to do this kind of comparison with. Once I have my original question figured out, something like this will likely be the next step. – HartyeTech Jul 10 '12 at 14:30
0

Your problem could be easily solved with my binomial coefficient class. Take a look at the code from my answer to a somewhat related problem. I don't know if porting the C# code over to a SQL stored proc would be a good idea or not. It would probably be easier to port it over to java or js and call your stored procs from that code.

Community
  • 1
  • 1
Bob Bryan
  • 3,687
  • 1
  • 32
  • 45