MATLAB: Find distribution of n-grams in long string, very fast

Question

I have a string myLongText which has ~300 MB. Now I also have a list of strings (stored as cells) myManyStrings with consists of all N-gram from N=1 to 5.

What I want now is: a variable myOccurances, which has length(myManyStrings) entries, and gives the amount of times each string from myManyStrings appeads in myLongText.

A straight-forward version would be:

myOccurances=zeros(1,length(myManyStrings));
for i=1:length(myManyStrings)
  myOccurances(i)=length(strfind(myLongText,myManyStrings{i});
end

But obviously, this solution is very slow. In a earlier version, myManyStringsOld consisted of individual words, thus I was able to use

allSplit=strread(myLongText,'%s','delimiter',' ');
[allUnique,~,occIndex]=unique(allSplit);
myOccurancesOld = hist(occIndex,1:length(allUnique));

However, now myManyStrings also involves higher N-grams, and I dont see how I can adjust my old (and surprisingly fast) method.

For example, now only for two-word combinations:

myLongText='Stack Overflow is a privately held website. In 2008, somebody created Stack Overflow.';
myManyStrings={'Stack', 'Overflow', 'is', 'a', 'privately', 'held', 'website', 'In', '2008', 'somebody', 'created', 'Stack Overflow', 'Overflow is', 'is a', 'a privately', 'privately held', 'held website', 'website in' 'in 2008', '2008 sombody', 'sombody created', 'created Stack'}.

Therefore,

myOccurances=[2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1];

Do you know any fast method to produce my result?

To conclude - you are asking for a way to (ideally, simultaneously) find substrings consisting of up to 5 words within a very long string? Please clarify: is it possible for the same word(s) in `myLongText` to be part of several different `myManyStrings{k}`? E.g. `myLongText = 'This is a question about strings'` and `myManyStrings = {'This is a question','question about strings','question'}`. — Dev-iL, Mar 11 '16 at 02:47
Dear Dev-iL (cool name btw ;) ): Yes, you are right, the same words can appear several times in the list. I added an example above. — Mario Krenn, Mar 11 '16 at 10:59
@Dev-iL and rayryeng: I have specified my question now, it is not a duplicate of the question you mentioned (How to implement a spectrum kernel function in MATLAB?) — Mario Krenn, Mar 11 '16 at 11:16
Thanks :) If you claim it's not a duplicate, would you be so kind as to specify what makes your problem distinct enough for the solution presented therein not to be applicable? Sure, the title is different and instead of syllables you're looking for complete words. But it seems to me that the solution would be almost identical, and you could get it with minimal modifications to the code of the accepted answer to the linked question. — Dev-iL, Mar 11 '16 at 12:02

MATLAB: Find distribution of n-grams in long string, very fast

0 Answers0