Splitting string into groups of "x" length

Question

I was wondering, if I had a string that was read from a text file, what would be the most efficient way of splitting it into groups of 5 characters? For example:

I have a text file called dna.txt, and its contents are:

>human
ACCGTGAAAAACGTGAGTATA
>mouse
ACCAAAAGTGTAT

I then have a Python script that will store the 2nd and 4th lines of the text file.

import linecache
f = open("dna.txt")
sequence_1 = linecache.getline('dna.txt', 2)
sequence_2 = linecache.getline('dna.txt', 4)
f.close()

The goal is for the program to print out:

>human
ACCGT
GAAAA
ACGTG
AGTAT
A
>mouse
ACCAA
AAGTG
TAT

Like I said before, I've been trying to come up with an efficient way of breaking the 2 strings, but with no luck. Help would be much appreciated, thanks!

Related: http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks — Phillip, Aug 07 '14 at 18:58
No. I'm new to Python and don't really know how to approach it. — Évariste Galois, Aug 07 '14 at 18:59

NPE · Answer 1 · 2014-08-07T19:08:36.153

3

This should get you started:

human = "ACCGTGAAAAACGTGAGTATA"
print(' '.join(human[i:i+5] for i in range(0, len(human), 5)))

It's easy to generalize this into a generator that takes human and 5 as arguments and yields the substrings:

def splitn(s, n):
    for i in range(0, len(s), n):
        yield s[i:i+n]

print(' '.join(splitn("ACCGTGAAAAACGTGAGTATA", 5)))

edited Aug 07 '14 at 19:08

answered Aug 07 '14 at 18:58

NPE

486,780
108
951
1,012

I'm using the latest version of Python 3, is that the proper format for it? Or is that for 2? – Évariste Galois Aug 07 '14 at 19:00
@Muffinman: The `In[x]:` and `Out[x]:` bits are IPython prompts: http://ipython.org/. – NPE Aug 07 '14 at 19:01

Chris Martin · Accepted Answer · 2014-08-07T19:55:52.717

1

>>> human = 'ACCGTGAAAAACGTGAGTATA'
>>> mouse = 'ACCAAAAGTGTAT'
>>> import re
>>> def format_dna(s):
...     return re.sub(r'(.{5})(?!$)', r'\g<1>\n', s)
...
>>> print(format_dna(human))
ACCGT
GAAAA
ACGTG
AGTAT
A
>>> print(format_dna(mouse))
ACCAA
AAGTG
TAT

re.sub does regular expression replacements in the string.

(.{5})(?!$) is the pattern to match. \g<1>\n is the pattern to substitute.

.{5} matches any five characters. With parens (.{5}) it's a capture group.

$ matches the end of the string. (?!$) is a negative lookahead assertion. This prevents the pattern from matching the last group if the string's length is a multiple of five (which would result in adding an unwanted newline at the end of the string).

\g<1> is a backreference that refers to the first (and only) capture group.

So this says: When you see five characters in a row (that aren't the last five), replace them with the five characters, plus a newline.

edited Aug 07 '14 at 19:55

answered Aug 07 '14 at 19:01

Chris Martin

30,334
10
78
137

All of the In [x]: parts result in the error: End of Statement Expected, and Statement expected, found Py:COLON – Évariste Galois Aug 07 '14 at 19:05
Reformatted it to look like a standard python prompt. – Chris Martin Aug 07 '14 at 19:06
It works, but could you explain the 5th line to me? Is that just purely for formatting? – Évariste Galois Aug 07 '14 at 19:09
Updated the answer to explain. – Chris Martin Aug 07 '14 at 19:56

score 0 · Answer 3 · answered Aug 07 '14 at 19:03

0

Before using this, look at the other answers that were posted while I was writing this up. This is a simple, and very sub-optimal algorithm.

Not super efficient but you could do it with simple loop.

str = "abcdefghijlkmnopqrstuvwxyz"
length = len(str)
i = 0
subs = []
while (i < length):
    subs[i/5] = str[i:i+5]
    i += 5

This should end up with each index of subs containing five characters grouped.

answered Aug 07 '14 at 19:03

David

696
6
19

subs[i/5] = str[i:i+5] TypeError: list indices must be integers, not float – Évariste Galois Aug 07 '14 at 19:06
@Muffinman sorry, cast it to int. I/5 should always give you an int (0/5,5/5,10/5,etc) – David Aug 07 '14 at 19:09
@David not in python3 ... you need `3//5` for an int – Joran Beasley Aug 07 '14 at 19:18
@PadraicCunningham I know. It was simply an example. – David Aug 07 '14 at 19:40

score 0 · Answer 4 · answered Aug 07 '14 at 19:11

You can do this fairly easily with a generator expression or a list comprehension in older versions of Python. For any string s and index i within the string the expression s[i:i+5] will evaluate to a substring of s of maximum length 5 starting at position i.

If i+5 happens to point past the end of the string the slicing notation conveniently suppresses any index errors and just returns the longest string it can.

So the expression

[sequence_1[i:i+5] for i in range((0, len(sequence_1), 5)]

should give you a list of the substrings you need.

Joran Beasley · Answer 5 · 2014-08-07T20:38:10.000

0

>>> def groupdna(long_seq,size=5):
...     groups = itertools.izip_longest(*[iter(long_seq)]*size,fillvalue="")
...     return list(map("".join,groups))
...
>>> groupdna(human,5)
['ACCGT', 'GAAAA', 'ACGTG', 'AGTAT', 'A']
>>> groupdna(mouse,5)
['ACCAA', 'AAGTG', 'TAT']
>>>

is a fun way to do it :P

edited Aug 07 '14 at 20:38

answered Aug 07 '14 at 19:20

Joran Beasley

110,522
12
160
179

meh I just changed it to use izip_longest instead ... and broke it out into slightly more bytesize chunks – Joran Beasley Aug 07 '14 at 20:39
Yep tha will work except it is `zip_longest` in python 3 – Padraic Cunningham Aug 07 '14 at 20:49

Padraic Cunningham · Answer 6 · 2014-08-07T20:49:03.793

In [1]: import re

In [2]: s = "ACCGTGAAAAACGTGAGTATA"

In [3]: print("\n".join(re.findall("\w{5}|\w+",s))) 
ACCGT
GAAAA
ACGTG
AGTAT
A

re.findall("\w{5}|\w+",s) finds 5 chars together or one or more chars together.

Some timings:

In [72]: %timeit "\n".join(groupdna(s,5))
100000 loops, best of 3: 3.5 µs per loop

In [73]: timeit ('\n'.join(splitn(s, 5)))
100000 loops, best of 3: 2.22 µs per loop

In [74]: %timeit re.sub(r'(.{5})(?!$)', r'\g<1>\n', s)
100000 loops, best of 3: 5.24 µs per loop

In [75]: %timeit ("\n".join(re.findall("\w{5}|\w+",s)))
100000 loops, best of 3: 2.16 µs per loop

Splitting string into groups of "x" length

6 Answers6