Frequent words in Python

Question

How can I write a code to find the most frequent 2-mer of "GATCCAGATCCCCATAC". I have written this code but it seems that I am wrong, please help in correcting me.

def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count

This code prints the most frequent k-mer in a string but it don't give me the 2-mer in the given string.

Please [edit] your question and include possible values for `Pattern` and `Text`. Hint: In Python, function and variable names don't start with an uppercase letter. Those are reserved for class names. — , Dec 14 '16 at 15:36
Your question assumes that we know what a 2-mer is. Unfortunately, without knowing what a 2-mer is, it's really hard to tell you how to find the most frequent one. — mgilson, Dec 14 '16 at 15:40
A 2-mer is basically the most frequent 2 words which is repeated in the given string,generally we can call it k-mer.For example, "ACTAT" is a most frequent 5-mer for Text = "ACAACTATGCATACTATCGGGAACTATCCT". — shahzad fida, Dec 14 '16 at 15:43
That is the code which I have written for finding the most frequent repeated word in a string ,e.g,in "CGATATATCCATAG",the most frequent word is "ATA",so that is the pattern and the given string is the Text. — shahzad fida, Dec 14 '16 at 15:46
I have written that code for finding the k-mer but I don't know that this code will help me in the given question,so please me in correcting me. — shahzad fida, Dec 14 '16 at 15:47
@shahzadfida -- Ok... If I have the string `AAAA`, is the 2-mer `AA` repeated 2 times or 3 times? i.e. can they overlap? — mgilson, Dec 14 '16 at 15:50
@shahzadfida Look at my answer it's quite simple and it show you how it deals with overlaping k-mer (i.e. it counts them) — MMF, Dec 14 '16 at 16:17

MMF · Answer 1 · 2016-12-14T17:36:05.993

7

You can first define a function to get all the k-mer in your string :

def get_all_k_mer(string, k=1):
   length = len(string)
   return [string[i: i+ k] for i in xrange(length-k+1)]

Then you can use collections.Counter to count the repetition of each k-mer:

>>> from collections import Counter
>>> s = 'GATCCAGATCCCCATAC'
>>> Counter(get_all_k_mer(s, k=2))

Ouput :

Counter({'AC': 1,
         'AG': 1,
         'AT': 3,
         'CA': 2,
         'CC': 4,
         'GA': 2,
         'TA': 1,
         'TC': 2})

Another example :

>>> s = "AAAAAA"
>>> Counter(get_all_k_mer(s, k=3))

Output :

Counter({'AAA': 4})
# Indeed : AAAAAA
           ^^^     -> 1st time
            ^^^    -> 2nd time
             ^^^   -> 3rd time
               ^^^ -> 4th time

edited Dec 14 '16 at 17:36

answered Dec 14 '16 at 15:52

MMF

5,750
3
16
20

Sorry about the downvote,but I correct the code in python.as in python 3 there is no xrange,it is simply range. – shahzad fida Dec 14 '16 at 16:22
`range` and `xrange` will give same results but they are not same objects. `xrange` is kind of a generator while `range` creates a list directly in memory. So if you have a very long string you should preferably use `xrange` – MMF Dec 14 '16 at 16:25
1

Pythonic. I would mention something about `xrange` being in Python 2 and resolving it with `range` in 3. – pylang Dec 15 '16 at 02:05

Patrick Haugh · Answer 2 · 2016-12-14T15:52:41.607

3

In general, when I want to count things with python I use a Counter

from itertools import tee
from collections import Counter

dna = "GATCCAGATCCCCATAC"
a, b = tee(iter(dna), 2)
_ = next(b)
c = Counter(''.join(l) for l in zip(a,b))
print(c.most_common(1))

This prints [('CC', 4)], a list of the 1 most common 2-mers in a tuple with their count in the string.

In fact, we can generalize this to the find the most common n-mer for a given n.

from itertools import tee, islice
from collections import Counter

def nmer(dna, n):
    iters = tee(iter(dna), n)
    iters = [islice(it, i, None) for i, it in enumerate(iters)]
    c = Counter(''.join(l) for l in zip(*iters))
    return c.most_common(1)

edited Dec 14 '16 at 15:52

answered Dec 14 '16 at 15:46

Patrick Haugh

59,226
13
88
96

You shouldn't need a `tee` here. That'll double the intermediate storage that is necessary -- However, strings can be iterated over multiple times so you can just do `a, b = iter(dna), iter(dna)`. – mgilson Dec 14 '16 at 15:51
I have problem in understanding the above code.Please elaborate it as I do not know for what a,b,c and _ stand. – shahzad fida Dec 14 '16 at 15:55
But if we edit the code to a ,b = iter(dna), iter(dna), It still give the same result. – shahzad fida Dec 14 '16 at 15:58
@shahzadfida `a` and `b` don't stand for anything, they're just variable names. `c` stands for `Counter`. As @mgilson points out, this is an overly complex solution if you are guaranteed to have input in the form of strings. If your input could come from another iterator, like a generator, this solution doesn't rely on indexing and slicing, which some types don't support – Patrick Haugh Dec 14 '16 at 15:59
@shahzadfida `Counter` is a subtype of `dict`. – Patrick Haugh Dec 14 '16 at 16:00
Please use simple code as I am learning bioinformatics and I have a little background in programming. But "thanks alot" for your reply and time. – shahzad fida Dec 14 '16 at 16:10
@shahzadfida the other option is `defaultdict(int)`, which isn't more clear for a beginner. `Counter` is the way to go. – pylang Dec 14 '16 at 16:44

score 3 · Answer 3 · edited May 23 '17 at 12:16

If you want a simple approach, consider a sliding window technique. An implementation is available in more_itertools, so you don't have to make one yourself. This is easy to use if you pip install more_itertools.

Simple Example

>>> from collections import Counter
>>> import more_itertools

>>> s = "GATCCAGATCCCCATAC"
>>> Counter(more_itertools.windowed(s, 2))
Counter({('A', 'C'): 1,
         ('A', 'G'): 1,
         ('A', 'T'): 3,
         ('C', 'A'): 2,
         ('C', 'C'): 4,
         ('G', 'A'): 2,
         ('T', 'A'): 1,
         ('T', 'C'): 2})

The above example demonstrates what little is required to get most of the information you want using windowed and Counter.

Description

A "window" or container of length k=2 is sliding across the sequence one stride at a time (e.g. step=1). Each new group is added as a key to the Counter dictionary. For each occurrence, the tally is incremented. The final Counter object primarily reports all tallies and includes other helpful features.

Final Solution

If actual string pairs is important, that is simple too. We will make a general function that groups the strings and works for any k mers:

>>> from collections import Counter
>>> import more_itertools

>>> def count_mers(seq, k=1):
...     """Return a counter of adjacent mers."""
...     return Counter(("".join(mers) for mers in more_itertools.windowed(seq, k)))

>>> s = "GATCCAGATCCCCATAC"
>>> count_mers(s, k=2)
Counter({'AC': 1,
         'AG': 1,
         'AT': 3,
         'CA': 2,
         'CC': 4,
         'GA': 2,
         'TA': 1,
         'TC': 2})

Are you asking where are the numbers coming from? They are the sum occurrences of adjacent letters in the string from left to right. How the keys appear in any dictionary is unordered (except Python 3.6). — pylang, Dec 15 '16 at 02:02
more_itertools return with an error and if we simply import itertool,it still don't do the job for us,It returns with the result that" module itertools has no attribute as window"? — shahzad fida, Dec 15 '16 at 02:07
This is a separate package from `itertools`. In your command prompt, you will want to `pip install more_itertools` for the more recent version. Once installed, then go to your python session and `import more_itertools`. You may have to restart your session. — pylang, Dec 15 '16 at 02:32

Frequent words in Python

3 Answers3