2

I have a string that contains a sequence of nucleotides. The string is 1191 nucleotides long.

How do I print the sequence in a format which each line only has 100 nucleotides? right now I have it hard coded but I would like it to work for any string of nucleotides. here is the code I have now

def printinfasta(SeqName, Sequence, SeqDescription):
    print(SeqName + " " + SeqDescription)
    #how do I make sure to only have 100 nucleotides per line?
    print(Sequence[0:100])
    print(Sequence[100:200])
    print(Sequence[200:300])
    print(Sequence[400:500])
    print(Sequence[500:600])
    print(Sequence[600:700])
    print(Sequence[700:800])
    print(Sequence[800:900])
    print(Sequence[900:1000])
    print(Sequence[1000:1100])
    print(Sequence[1100:1191])
printinfasta(SeqName, Sequence, SeqDescription)
Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
cam
  • 31
  • 1
  • 8
  • Here is an answer of interest [DNA to RNA using str.translate()](https://stackoverflow.com/questions/32018654) – Trenton McKinney Sep 15 '20 at 00:19
  • 1
    Does this answer your question? [Split a string to even sized chunks](https://stackoverflow.com/questions/21351275/split-a-string-to-even-sized-chunks) – Trenton McKinney Sep 15 '20 at 00:24

6 Answers6

4

You can use textwrap.wrap to split long strings into list of strings

import textwrap

seq = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"
print('\n'.join(textwrap.wrap(seq, width=100)))
Blownhither Ma
  • 1,461
  • 8
  • 18
2

You can use itertools.zip_longest and some iter magic to get this in one line:

from itertools import zip_longest

sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT" 

output = [''.join(filter(None, s)) for s in zip_longest(*([iter(sequence)]*100))]

Or:

for s in zip_longest(*([iter(sequence)]*100)):
    print(''.join(filter(None, s)))
Jab
  • 26,853
  • 21
  • 75
  • 114
2

A possible solution is to use re module.

import re

def splitstring(strg, leng):
    chunks = re.findall('.{1,%d}' % leng, strg)
    for i in chunks:
        print(i)


splitstring(strg = seq, leng = 100))
Agaz Wani
  • 5,514
  • 8
  • 42
  • 62
2

I assume that your sequence is in FASTA format. If this is the case, you can use any of a number of bioinformatics packages that provide FASTA sequence wrapping utilities. For example, you can use FASTX-Toolkit. Wrap FASTA sequences using FASTA Formatter command line utility, for example to a max of 100 nucleotides per line:

fasta_formatter -i INFILE -o OUTFILE -w 100

You can install FASTX-Toolkit package using conda, for example:
conda install fastx_toolkit
or
conda create -n fastx_toolkit fastx_toolkit

Note that if you end up writing the (simple) code to wrap FASTA sequences from scratch, remember that the header lines (the lines starting with >) should not be wrapped. Wrap only the sequence lines.

SEE ALSO:

Convert single line fasta to multi line fasta

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
2

You can use a helper function based on itertools.zip_longest. The helper function has been designed to (also) handle cases where the sequence isn't an exact multiple of the size of the equal parts (the last group will have fewer elements than those before it).

from itertools import zip_longest


def grouper(n, iterable):
    """ s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... """
    FILLER = object()  # Value that couldn't be in data.
    for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
        yield ''.join(v for v in result if v is not FILLER)


def printinfasta(SeqName, Sequence, SeqDescription):
    print(SeqName + " " + SeqDescription)
    for group in grouper(100, Sequence):
        print(group)

Sequence = "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTAAATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCCTAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTTTGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACATTTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT"

printinfasta('Name', Sequence, 'Description')

Sample output:

Name Description
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAATCCATAAATCCCTAAAACCATAATCCTAAATCCCTTAATTCCTA
AATCCCTAATACTTAGACCCTAATCTTTAGTTCCTAGACCCTAATCTTTAGTTCCTAGACCCTAAATCCATAATCCTTAATTCCTAAATTCCTAAATCCC
TAATACTAAATCTCTAAATCCCTAGCAATTTTCAAGTTTTGCTTGATTGTTGTAGGATGGTCCTTTCTCTTGTTTCTTCTCTGTGTTGTTGAGATTAGTT
TGTTTAGGTTTGATAGCGTTGATTTTGGCCTGCGTTTGGTGACTCATATGGTTTGATTGGAGTTTGTTTCTGGGTTTTATGGTTTTGGTTGAAGCGACAT
TTTTTTGTGGAATATGGTTTTTGCAAAATATTTTGTTCCGGATGAGTAATATCTACGGTGCTGCTGTGAGAATTATGCTATTGTTTT
martineau
  • 119,623
  • 25
  • 170
  • 301
0

Package cytoolz (installable using pip install cytoolz) provides a function partition_all that can be used here:

#!/usr/bin/env python3
from cytoolz import partition_all

def printinfasta(name, seq, descr):
    header = f">{name} {descr}"
    print(header)
    print(*map("".join, partition_all(100, seq)), sep="\n")


printinfasta("test", 468 * "ACGTGA", "this is a test")

partition_all(100, seq)) generate tuples of 100 letters each taken from seq, and a last shorter one is the number of letters is not a multiple of 100.

The map("".join, ...) is used to group letters within each such tuple into a single string.

The * in front of the map makes its results considered as separate arguments to print.

bli
  • 7,549
  • 7
  • 48
  • 94