1

Does anyone know how to find subsequence patterns that occurs with mismatches?

For mismatch kernel function, it allows m mismatches. For example, 'tool' has three 3-grams ('too', 'ool'), and mismatch kernel function will count 'aoo','boo',...,'zoo','tao'...'tzo','toa'...'toz',.... when the m is 1. For specific explanation, please see http://www.cs.columbia.edu/~cleslie/cs4761/papers/string-kernel-slides.pdf Just see page 12-17

How can I write a MATLAB function that could calculate this metric?

Thanks so much.

Alisa W
  • 19
  • 4
  • 2
    You mention a research paper and link to a 60-page slide set, and ask how to implement something in Matlab. For anyone wanting to answer your question, this is quite a lot of material to work with. A better approach is to attempt to do this yourself, and ask questions once you get stuck somewhere *specific* that is more easily summarized, and post the related code. – mikkola Apr 03 '16 at 09:40
  • Thanks for your suggestions. I have deleted the paper and denoted the specific range of the slide. I thought if someone else had seen this paper, it would be much easier to answer this question. And the main problem is that how to implement the previous spectrum string function within tolerable mismatches. I have read many references which talked about suffix tree, or lowest common ancestor, but I still don't have any thought, so I put forward this question. – Alisa W Apr 03 '16 at 11:37

1 Answers1

0

The solution I'm going to propose is actually based on a bruteforce approach (no fancy mismatch trees).
If you're dealing with nucleotides you have just 4 symbols and that's fine, however for large alphabets and/or large values for k it might be quite memory-inefficient.
The good news is though that it does not contain any loops and every statement is vectorized, so it works pretty fast.

First of all, I suppose you know how to select k-mers; if not, the solution proposed here (function n_gram() by gnovice) works brilliantly.
I don't like "n-gram", I prefer "k-mers" so I'll use that hereinafter to indicate substrings of length k.

Second, let's declare some variables:

m=1; k=3;
alphabet='abcdefghijklmnopqrstuvwxyz';
kmerUT='too';

where m is the distance (as in the slides), alphabet is rather self-explanatory (is the set of all the possible values), kmerUT is the k-mer under test (i.e. the k-mer whose distance we want to calculate) and k is the length of the k-mer.

Third, let's calculate all the possible combinations of k symbols from alphabet:

C = cell(k, 1);                     %// Preallocate a cell array
[C{:}] = ndgrid(alphabet);          %// Create K grids of values
combs = cellfun(@(x){x(:)}, C);     %// Convert grids to column vectors
combs = sortrows([combs{:}]);       %// Obtain all permutations

This snippet, in order, preallocates a cell-array with k cells. After the second expression, in these 3 cells there will be the values from alphabet with "different period" (1st cell: abcdefh... ; 2nd cell has 26 times a, 26 times b and so on ; 3rd cell has 2*26 times a, 2*26 times b and so on...) with "26" because they are the letters of the alphabet. The final line unwraps the cell array C into a matrix and then sort the combinations in alphabetical order. Credits to Eithan T.
Note: if you're running low on memory after this step (which is the most expensive, in terms of memory) you can as well remove the cell array C from main memory thanks to the command clear C. We won't need such variable from now on.

Fourth, compare every row in combs with the target k-mer:

fun=@(a,b) (a~=b);
comparisons=bsxfun(fun,combs,kmerUT);

so we declare a function fun that simply returns a logical vector with 1 in position j if a and b are different in position j and then apply (thanks to bsxfun()) such function to all the rows in combs. In other words we compare every row with the k-mer under test. comparisons will therefore be a logical matrix where ever row is the result of such comparisons.
Note: combs is another variable which can potentially take a lot of memory. Since we won't need it from now on you can as well clear combs.

Fifth, count the number of not-equal symbols for every tested combination:

counter=sum(comparisons,2);

having comparisons as a matrix of 1s and 0s, one can simply sum every row to get the number of 1s per row (i.e. the number of different symbols per row).
Note: counteris a vector whose size is card(alphabet)^k where card(alphabet) is the number of values in alphabet, because it must contain values for all possible combinations. Such vector might as well be huge in some cases and taking into account that every item in such vector takes 8 Bytes, the whole vector might take a good amount of memory. Now if you've cleared C and combs it shouldn't be a problem, but just for the sake of completeness you might want to cast counter using an integer type that takes less then 8 Bytes per item. More infos about numeric types in Matlab can be found here.

Sixth, count the number of combinations within m mismatches from kmerUT:

result=sum(counter<=m);
Community
  • 1
  • 1
AlessioX
  • 3,167
  • 6
  • 24
  • 40
  • Thanks for your answer. It's great. I thought it would use some method like suffix tree. But it turns out this is great, except it cost too much space. – Alisa W Apr 05 '16 at 09:26
  • @AlisaW, I absolutely agree. For large alphabets and/or large values for `k` this method can take a lot of memory. You can perform a preliminary analysis taking into account that the number of combinations `N` is *card(alphabet)^k* where *card(alphabet)* is the number of values in alphabet and *k* is the length of k-mers. Every combination will have length `k` of course therefore `N*k` is the size of the matrix that will contain all of your combinations. – AlessioX Apr 05 '16 at 09:42