Calculate transition matrix of letters

Question

How would I manage to create a transition matrix of letters?

I have a list of letters like so:

[u'T', u'i', u'r', u's', u'd', u'a', u'g', u' ', u's', u'k', u'a', u'l', u' ', u'd', u'u', u' ', u'i', u'n', u's', u't', u'a', u'l', u'l', u'e', u'r', u'e', u' ', u'e', u'n', u' ', u'P', u'y', u't', u'h', u'o', u'n', u' ', u'f', u'o', u'r', u't', u'o', u'l', u'k', u'e', u'r', u',', u' ', u'o', u'g', u' ', u'l',u'P', u'l', u'a', u'n', u' ', u'f', u'o', u'r', u' ', u'u', u'g', u'e', u'n', u'D', u'e', u'n', u'n', u'e', u' ', u'u', u'g', u'e', u' ', u'd', u'r', u'e', u'j', u'e', u'r', u' ', u's', u'i', u'g', u' ', u'o', u'm', u' ', u'a', u't', u' ', u'k', u'o', u'm', u'm', u'e', u' ', u'i', u'g', u'a', u'n', u'g', u' ', u'm', u'e', u'd', u' ', u'P', u'y', u't', u'h', u'o', u'n', u'.', u' ', u' ', u'T', u'i', u'r', u's', u'd', u'a', u'g', u' ', u's', u'k', u'a', u'l', u' ', u'd', u'u', u' ', u'i', u'n', u's', u't', u'a', u'l', u'l', u'e', u'r', u'e', u' ', u'e', u'n', u' ', u'P', u'y', u't', u'h', u'o', u'n', u' ', u'f', u'o', u'r', u't', u'o', u'l', u'k', u'e', u'r', u',', u' ', u'o', u'g', u' ', u'l', u'b', u'r', u'e', u' ', u'd', u'e', u'n', u'n', u'e', u' ', u'a', u't', u' ', u'k', u'e', u'n', u'd', u'e', u' ', u'v', u'e', u'd', u' ', u'a', u't', u' ', u'k', u'b', u'r', u'e', u' ', u'n', u'o', u'g', u'l', u'e', u' ', u'p', u'r', u'o', u'g', u'r', u'a', u'm', u'm', u'e', u'r', u'.', u' ', u' ', u'I', u'P', u'y', u't', u'h', u'o', u'n', u' ', u'k', u'a', u'n', u' ', u'a', u'n', u'b', u'e', u'f', u'a', u'l', u'e', u's', u' ', u'd', u'a', u' ', u'd', u'e', u'n', u'n', u'e', u' ', u'f', u'i', u'n', u'd', u'e', u's', u' ', u't', u'i', u'l', u' ', u'L', u'i', u'n']

How would I create a transition matrix based on this list of letters? I have the following code from Python transition matrix:

 def tmatrix(self, lst):
        b = [[0 for _ in xrange(len(lst))] for _ in xrange(len(lst))]
        for (x,y), c in Counter(zip(lst, lst[1:])).iteritems():
            b[x-1][y-1] = c
        return b

But I get the following error, since I have a list of unicode objects instead of ints. TypeError: unsupported operand type(s) for -: 'unicode' and 'int'. How would I convert the code to support unicode objects?

So, `['a', 'b', 'c']` should be `{('a', 'b'): 1, ('b', 'c'): 1}` ? — Jon Clements, Jan 18 '15 at 19:33
"How would I create a transition matrix based on this list of letters" - based on which rules ??? — Nir Alfasi, Jan 18 '15 at 19:34
The code you link to is counting on the sequences using integers. — Martijn Pieters, Jan 18 '15 at 19:37
@Smith: This is the desired output for the input in your question? I'm struggling to make sense of this. Please work on improving the clarity and completeness of your question. — NPE, Jan 18 '15 at 19:39
I am not sure on the output. I just want to accomplish this: How often is an A followed by an A How often is an A followed by a B How often is an A followed by a C .. And so on.. — fae53351b9effc708e764e871bef31, Jan 18 '15 at 19:43

Martijn Pieters · Accepted Answer · 2015-01-20T13:28:27.787

The code you link to is counting on the sequences using integers. The integers can then readily be transformed to indexes into the transformation matrix (1 is translated to index 0, etc.).

The algorithm you linked to also only works for unique elements, the matrix built there is 3 by 3, not 10 by 10.

You'd have to do the same for your input list:

from collections import Counter, defaultdict
from itertools import count

def tmatrix(self, lst):
    # defaultdict that'll produce a unique index for each unique character
    # encountered in lst
    indices = defaultdict(count().next)
    unique_count = len(set(lst))
    b = [[0 for _ in xrange(unique_count)] for _ in xrange(unique_count)]
    for (x, y), c in Counter(zip(lst, lst[1:])).iteritems():
        b[indices[x]][indices[y]] = c
    return b

Here the indices dictionary maps characters back to indices in the input list; an itertools.count() instance provides an auto-incrementing integer value for any character not already in the dictionary.

This produces a 29 by 29 matrix for your input sample:

>>> tmatrix(None, sample)
[[0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 0, 0, 2, 0, 0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 0, 0, 2, 5, 0, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 2, 0, 0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 3, 0, 2, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 2, 1, 0, 5, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 1, 0, 6, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 3, 0, 3, 6, 4, 0, 2, 4, 2, 2, 1, 1, 2, 3, 0, 0, 3, 4, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 3, 2, 2, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 2, 2, 0, 1, 7, 0, 0, 0, 3, 0, 3, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 2, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 6, 2, 2, 0, 0, 11, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 3, 0, 0, 0, 4, 0, 0, 2, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

You probably want to return the indices mapping too, so you know what character mapped to what index in that matrix.

@Smith: use `return b, indices`; the caller then gets passed a tuple; the resulting matrix and the character-to-index mapping. You can use tuple assignment to unpack that again: `matrix, indices = tmatrix(characters)`. — Martijn Pieters, Jan 18 '15 at 20:44

score 1 · Answer 2 · answered Jan 18 '15 at 19:45

You can pairwise the string (which is looks like it originally was in Danish), then use a Counter as a sparse matrix with a (from, to) as a key:

from collections import Counter
from itertools import tee, izip

data = 'Tirsdag skal du installere en Python fortolker, og lPlan for ugenDenne uge drejer sig om at komme igang med Python.  Tirsdag skal du installere en Python fortolker, og lbre denne at kende ved at kbre nogle programmer.  IPython kan anbefales da denne findes til Lin'
fst, snd = tee(data)
next(snd, '')
matrix = Counter(izip(fst, snd))

Then to get the transitions of a->b use matrix['a', 'b'] etc... For keys that don't exist, you'll automatically get back 0. If you absolutely want a 2D array of N x N, then use @Martijn's answer.

ic_fl2 · Answer 3 · 2016-07-13T13:25:01.233

This is an ordered version of @Martijn Pieters answer:

from collections import Counter, defaultdict
from itertools import count
import numpy as np


def tmatrix(lst):
    """Sorted and normalised transition matrix
    """
    indices = defaultdict(count().next)
    b = np.zeros([len(set(lst)),len(set(lst))])

    Ct = Counter(zip(lst, lst[1:])) # zip together consecutive elements of the list

    for (x, y), c in iter(sorted(Ct.iteritems())): # make sorted iteration to generate sorted trasition matrix
    #print (x,y), c
    b[indices[x]][indices[y]] = float(c)

    res = dict((v,k) for k,v in indices.iteritems())

    b = np.array(b)

    # Normalise 
    for i in range(len(b)):
        b[i] = b[i]/float(b.sum(axis=1)[i])

    return b, indices

Calculate transition matrix of letters

3 Answers3

Linked