-2

It may be easy to do but as a beginner it seems to me trivial.

I have text like this or file containing this text:

'fdhdhjduvduvfbvhufbvufvhifbusdbjhkbueigvuerafvguavgugvg'

How can use Python to split the text like this:

'fdh dhj duv duv fbv huf bvu fvh ifb usd bjh kbu eig vue raf vgu avg ugvg'
'f dhd hjd uvd uvf bvh ufb vuf vhi fbu sdb jhk bue igv uer afv gua vgu gvg'
'fd hdh jdu vdu vfb vhu fbv ufv hif bus dbj hkb uei gvu era fvg uav gug vg'

Then need to calculate frequency of three seq (for example how many 'fdh') and rank all most frequented seq.

I saw the answers here: What is the most "pythonic" way to iterate over a list in chunks?

But I do not know which one is good for me. Also I need to open a file that contain the text and write to another file. Please provide me an expert opinion.

EDIT:

with open(fasta, 'r') as fin, open(outfile, 'w') as fout:
        for item in Counter(s[i:i+4] for i in range(len(fin))).most_common():
            fout.write(item)

GIVES ME ERROR

TypeError: object of type '_io.TextIOWrapper' has no len()
Community
  • 1
  • 1
plasmid
  • 1
  • 5
  • This may be useful to you: [How do you split a list into evenly sized chunks in Python?](http://stackoverflow.com/q/312443/953482) For frequency counting, Try `collections.Counter`. You can learn how to read from and write to files in pretty much any Python tutorial, ex [this](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) one. – Kevin Feb 19 '15 at 12:52
  • Looks like you want the [n-gram](http://en.wikipedia.org/wiki/N-gram) algorithm (or the trigram to be more specific). Python has a n-gram module, I would start there. – Paulo Scardine Feb 19 '15 at 13:05
  • `fin` is a file object and has no length. Modify to `fin.read()` – FuzzyDuck Feb 19 '15 at 15:07

2 Answers2

0

Use regular expressions to split the string into chunks of 3, then use a dictionary comprehension to generate a dict which counts occurrences of each chunk.

import re

chunked = re.findall('...', your_string)
result = {key: chunked.count(k) for key in set(chunked)}

EDIT: to do the chunking without regex, and to capture the different ways of partitioning the string into chunks of 3, use a list comprehension:

chunked = [your_string[i:i+3] for i in xrange(len(your_string))]

It's inelegant, but to handle the 'f' and 'fd' cases, you can simply concatenate these to the end of chunked:

chunked = [your_string[i:i+3] for i in xrange(len(your_string))] + [your_string[:1], your_string[:2]]

Then apply the dictionary comprehension as before:

result = {key: chunked.count(k) for key in set(chunked)}

Result:

{'afv': 1,
'avg': 1,
'bjh': 1,
'bue': 1,
'bus': 1,
'bvh': 1,
'bvu': 1,
'dbj': 1,
'dhd': 1,
'dhj': 1,
'duv': 2,
'eig': 1,
'era': 1,
'f': 1,
'fbu': 1,
'fbv': 2,
'fd': 1,
'fdh': 1,
'fvg': 1,
'fvh': 1,
'g': 1,
'gua': 1,
'gug': 1,
'gvg': 1,
'gvu': 1,
'hdh': 1,
'hif': 1,
'hjd': 1,
'hkb': 1,
'huf': 1,
'ifb': 1,
'igv': 1,
'jdu': 1,
'jhk': 1,
'kbu': 1,
'raf': 1,
'sdb': 1,
'uav': 1,
'uei': 1,
'uer': 1,
'ufb': 1,
'ufv': 1,
'ugv': 1,
'usd': 1,
'uvd': 1,
'uvf': 1,
'vdu': 1,
'vfb': 1,
'vg': 1,
'vgu': 2,
'vhi': 1,
'vhu': 1,
'vue': 1,
'vuf': 1}
FuzzyDuck
  • 1,492
  • 12
  • 14
0
>>> from collections import Counter
>>> s = 'fdhdhjduvduvfbvhufbvufvhifbusdbjhkbueigvuerafvguavgugvg'
>>> for item in Counter(s[i:i+3] for i in range(len(s))).most_common():
...     print item
... 
('fbv', 2)
('vgu', 2)
('duv', 2)
('raf', 1)
('fbu', 1)
('dbj', 1)
('uei', 1)
('bvu', 1)
('vg', 1)
('bjh', 1)
('hjd', 1)
('bvh', 1)
('uvd', 1)
('ugv', 1)
('uvf', 1)
('kbu', 1)
('igv', 1)
('usd', 1)
('dhj', 1)
('fvh', 1)
('fvg', 1)
('dhd', 1)
('gvg', 1)
('afv', 1)
('uer', 1)
('gvu', 1)
('huf', 1)
('eig', 1)
('bus', 1)
('ufb', 1)
('avg', 1)
('sdb', 1)
('hif', 1)
('hkb', 1)
('gug', 1)
('uav', 1)
('ufv', 1)
('bue', 1)
('vuf', 1)
('gua', 1)
('vue', 1)
('vdu', 1)
('g', 1)
('vhu', 1)
('fdh', 1)
('jhk', 1)
('vfb', 1)
('vhi', 1)
('era', 1)
('ifb', 1)
('jdu', 1)
('hdh', 1)
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • But if I want to print like this: 'fdh dhj duv duv fbv huf bvu fvh ifb usd bjh kbu eig vue raf vgu avg ugv' 'dhd hjd uvd uvf bvh ufb vuf vhi fbu sdb jhk bue igv uer afv gua vgu gvg' 'hdh jdu vdu vfb vhu fbv ufv hif bus dbj hkb uei gvu era fvg uav gug' What modification I need to include? – plasmid Feb 19 '15 at 13:25