3

A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common).

How can I write a MATLAB function that could calculate this metric?

gnovice
  • 125,304
  • 15
  • 256
  • 359
  • 1
    It sounds to me like you are interested in creating a Kernel function for some other purpose, say, for a support vector machine classifier. For that, the hamming distance alone does not take full advantage of the information available. Look to the source links here: http://en.wikipedia.org/wiki/String_kernel Good luck! I am trying to do the same thing. –  Aug 07 '11 at 02:23

2 Answers2

2

The first step would be to create a function that can generate an n-gram for a given string. One way to do this in a vectorized fashion is with some clever indexing.

function [subStrings, counts] = n_gram(fullString, N)
  if (N == 1)
    [subStrings, ~, index] = unique(cellstr(fullString.'));  %.'# Simple case
  else
    nString = numel(fullString);
    index = hankel(1:(nString-N+1), (nString-N+1):nString);
    [subStrings, ~, index] = unique(cellstr(fullString(index)));
  end
  counts = accumarray(index, 1);
end

This uses the function HANKEL to first create a matrix of indices that will select each set of unique N-length substrings from the given string. Indexing the given string with this index matrix will create a character array with one N-length substring per row. The function CELLSTR then places each row of the character array into a cell of a cell array. The function UNIQUE then removes repeated substrings, and the function ACCUMARRAY is used to count the occurrences of each unique substring (if they are needed for any reason).

With the above function you can then easily count the number of n-grams shared between two strings using the INTERSECT function:

subStrings1 = n_gram('tool',2);
subStrings2 = n_gram('fool',2);
sharedStrings = intersect(subStrings1,subStrings2);
nShared = numel(sharedStrings);
gnovice
  • 125,304
  • 15
  • 256
  • 359
  • You should add subStrings = intersect(subStrings, subStrings); to the function n_gram to avoid repetition. For instance, n_gram('hubbub', 2) would have a repeated 'ub' member in the original formulation. – mtrw Jul 27 '09 at 23:16
  • @mtrw: Good point, but the function UNIQUE would work better. I'll update accordingly. – gnovice Jul 27 '09 at 23:29
  • Oof. Yeah, UNIQUE would be quite a bit more appropriate. – mtrw Jul 27 '09 at 23:54
-1

What you're looking for is called the Hamming distance, you can get a better description of it if you do doc pdist.

A=['Marcin'; 'Martin'; 'Marsha']  %data

squareform(pdist(A, 'hamming'))  returns

         0    0.1667    0.5000

    0.1667         0    0.5000

    0.5000    0.5000         0

This form shows you how many letters are different. Difference between 'Marcin' and 'Martin' is 1 out of 6 letters, so you get 1/6=0.1667 'Marcin' vs 'Marsha' has 3 of 6, so 3/6=0.5
If you want the actual number of letters that are different, just multiply the whole matrix by length(A).

Marcin
  • 3,437
  • 1
  • 22
  • 15