I have a string myLongText
which has ~300 MB. Now I also have a list of strings (stored as cells) myManyStrings
with consists of all N-gram from N=1 to 5.
What I want now is: a variable myOccurances
, which has length(myManyStrings)
entries, and gives the amount of times each string from myManyStrings
appeads in myLongText
.
A straight-forward version would be:
myOccurances=zeros(1,length(myManyStrings));
for i=1:length(myManyStrings)
myOccurances(i)=length(strfind(myLongText,myManyStrings{i});
end
But obviously, this solution is very slow. In a earlier version, myManyStringsOld
consisted of individual words, thus I was able to use
allSplit=strread(myLongText,'%s','delimiter',' ');
[allUnique,~,occIndex]=unique(allSplit);
myOccurancesOld = hist(occIndex,1:length(allUnique));
However, now myManyStrings
also involves higher N-grams, and I dont see how I can adjust my old (and surprisingly fast) method.
For example, now only for two-word combinations:
myLongText='Stack Overflow is a privately held website. In 2008, somebody created Stack Overflow.';
myManyStrings={'Stack', 'Overflow', 'is', 'a', 'privately', 'held', 'website', 'In', '2008', 'somebody', 'created', 'Stack Overflow', 'Overflow is', 'is a', 'a privately', 'privately held', 'held website', 'website in' 'in 2008', '2008 sombody', 'sombody created', 'created Stack'}.
Therefore,
myOccurances=[2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1];
Do you know any fast method to produce my result?