If you want a simple approach, consider a sliding window technique. An implementation is available in more_itertools, so you don't have to make one yourself. This is easy to use if you pip install more_itertools
.
Simple Example
>>> from collections import Counter
>>> import more_itertools
>>> s = "GATCCAGATCCCCATAC"
>>> Counter(more_itertools.windowed(s, 2))
Counter({('A', 'C'): 1,
('A', 'G'): 1,
('A', 'T'): 3,
('C', 'A'): 2,
('C', 'C'): 4,
('G', 'A'): 2,
('T', 'A'): 1,
('T', 'C'): 2})
The above example demonstrates what little is required to get most of the information you want using windowed
and Counter
.
Description
A "window" or container of length k=2
is sliding across the sequence one stride at a time (e.g. step=1
). Each new group is added as a key to the Counter
dictionary. For each occurrence, the tally is incremented. The final Counter
object primarily reports all tallies and includes other helpful features.
Final Solution
If actual string pairs is important, that is simple too. We will make a general function that groups the strings and works for any k mers:
>>> from collections import Counter
>>> import more_itertools
>>> def count_mers(seq, k=1):
... """Return a counter of adjacent mers."""
... return Counter(("".join(mers) for mers in more_itertools.windowed(seq, k)))
>>> s = "GATCCAGATCCCCATAC"
>>> count_mers(s, k=2)
Counter({'AC': 1,
'AG': 1,
'AT': 3,
'CA': 2,
'CC': 4,
'GA': 2,
'TA': 1,
'TC': 2})