-1

I was wondering, if I had a string that was read from a text file, what would be the most efficient way of splitting it into groups of 5 characters? For example:

I have a text file called dna.txt, and its contents are:

>human
ACCGTGAAAAACGTGAGTATA
>mouse
ACCAAAAGTGTAT

I then have a Python script that will store the 2nd and 4th lines of the text file.

import linecache
f = open("dna.txt")
sequence_1 = linecache.getline('dna.txt', 2)
sequence_2 = linecache.getline('dna.txt', 4)
f.close()

The goal is for the program to print out:

>human
ACCGT
GAAAA
ACGTG
AGTAT
A
>mouse
ACCAA
AAGTG
TAT

Like I said before, I've been trying to come up with an efficient way of breaking the 2 strings, but with no luck. Help would be much appreciated, thanks!

Évariste Galois
  • 1,043
  • 2
  • 13
  • 27

6 Answers6

3

This should get you started:

human = "ACCGTGAAAAACGTGAGTATA"
print(' '.join(human[i:i+5] for i in range(0, len(human), 5)))

It's easy to generalize this into a generator that takes human and 5 as arguments and yields the substrings:

def splitn(s, n):
    for i in range(0, len(s), n):
        yield s[i:i+n]

print(' '.join(splitn("ACCGTGAAAAACGTGAGTATA", 5)))
NPE
  • 486,780
  • 108
  • 951
  • 1,012
1
>>> human = 'ACCGTGAAAAACGTGAGTATA'
>>> mouse = 'ACCAAAAGTGTAT'
>>> import re
>>> def format_dna(s):
...     return re.sub(r'(.{5})(?!$)', r'\g<1>\n', s)
...
>>> print(format_dna(human))
ACCGT
GAAAA
ACGTG
AGTAT
A
>>> print(format_dna(mouse))
ACCAA
AAGTG
TAT

re.sub does regular expression replacements in the string.

(.{5})(?!$) is the pattern to match. \g<1>\n is the pattern to substitute.

.{5} matches any five characters. With parens (.{5}) it's a capture group.

$ matches the end of the string. (?!$) is a negative lookahead assertion. This prevents the pattern from matching the last group if the string's length is a multiple of five (which would result in adding an unwanted newline at the end of the string).

\g<1> is a backreference that refers to the first (and only) capture group.

So this says: When you see five characters in a row (that aren't the last five), replace them with the five characters, plus a newline.

Chris Martin
  • 30,334
  • 10
  • 78
  • 137
0

Before using this, look at the other answers that were posted while I was writing this up. This is a simple, and very sub-optimal algorithm.

Not super efficient but you could do it with simple loop.

str = "abcdefghijlkmnopqrstuvwxyz"
length = len(str)
i = 0
subs = []
while (i < length):
    subs[i/5] = str[i:i+5]
    i += 5

This should end up with each index of subs containing five characters grouped.

David
  • 696
  • 6
  • 19
0

You can do this fairly easily with a generator expression or a list comprehension in older versions of Python. For any string s and index i within the string the expression s[i:i+5] will evaluate to a substring of s of maximum length 5 starting at position i.

If i+5 happens to point past the end of the string the slicing notation conveniently suppresses any index errors and just returns the longest string it can.

So the expression

[sequence_1[i:i+5] for i in range((0, len(sequence_1), 5)]

should give you a list of the substrings you need.

holdenweb
  • 33,305
  • 7
  • 57
  • 77
0
>>> def groupdna(long_seq,size=5):
...     groups = itertools.izip_longest(*[iter(long_seq)]*size,fillvalue="")
...     return list(map("".join,groups))
...
>>> groupdna(human,5)
['ACCGT', 'GAAAA', 'ACGTG', 'AGTAT', 'A']
>>> groupdna(mouse,5)
['ACCAA', 'AAGTG', 'TAT']
>>>

is a fun way to do it :P

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0
In [1]: import re

In [2]: s = "ACCGTGAAAAACGTGAGTATA"

In [3]: print("\n".join(re.findall("\w{5}|\w+",s))) 
ACCGT
GAAAA
ACGTG
AGTAT
A

re.findall("\w{5}|\w+",s) finds 5 chars together or one or more chars together.

Some timings:

In [72]: %timeit "\n".join(groupdna(s,5))
100000 loops, best of 3: 3.5 µs per loop

In [73]: timeit ('\n'.join(splitn(s, 5)))
100000 loops, best of 3: 2.22 µs per loop

In [74]: %timeit re.sub(r'(.{5})(?!$)', r'\g<1>\n', s)
100000 loops, best of 3: 5.24 µs per loop

In [75]: %timeit ("\n".join(re.findall("\w{5}|\w+",s)))
100000 loops, best of 3: 2.16 µs per loop
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321