0

I have a file that contains millions of sequences. What I want to do is to get 5mers from each sequence in every line of my file.

My file looks like this:

CGATGCATAGGAA
GCAGGAGTGATCC

my code is:

with open('test.txt','r') as file:
    for line in file:
        for i in range(len(line)):
            kmer = str(line[i:i+5])
            if len(kmer) == 5:
                print(kmer)
            else:
                pass

with this code, I should not get 4 mers but I do even I have an if statement for the length of 5mers. Could anyone help me with this? Thanks

my out put is:

CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA
GGAA

GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
ATCC

but the ideal output should be only the one with length equal to 5 (for each line separately):

CGATG
GATGC
ATGCA
TGCAT
GCATA
CATAG
ATAGG
TAGGA
AGGAA

GCAGG
CAGGA
AGGAG
GGAGT
GAGTG
AGTGA
GTGAT
TGATC
GATCC
bad_coder
  • 11,289
  • 20
  • 44
  • 72
Apex
  • 1,055
  • 4
  • 22
  • 1
    What is a 5mer? When you looked at (inspected/printed) intermediate values did you notice anything wrong? If you are using an IDE **now** is a good time to learn its debugging features – wwii Feb 08 '21 at 19:18
  • Try: `for line...: line = line.strip(); for i in range(len(line)-4):...` – wwii Feb 08 '21 at 19:31
  • If you are using an IDE **now** is a good time to learn its debugging features Or the built-in [Python debugger](https://docs.python.org/3/library/pdb.html). Printing *stuff* at strategic points in your program can help you trace what is or isn't happening. [What is a debugger and how can it help me diagnose problems?](https://stackoverflow.com/questions/25385173/what-is-a-debugger-and-how-can-it-help-me-diagnose-problems) – wwii Feb 08 '21 at 19:32
  • 5mer means to read each line by five letters - so for the line in the file it should be CGATG then GATGC then ATGCA, ... until it reaches the end. These outputs should be printed separately for each line in the txt file. – Apex Feb 08 '21 at 20:46
  • I updated the question – Apex Feb 08 '21 at 20:48
  • The k-mers in a string are all the substrings of length k. So, given the string "ELEPHANT", and k-mer length k=4, the k-mers are: ELEP LEPH EPHA PHAN HANT – Apex Feb 08 '21 at 20:51

1 Answers1

1

When iterating through a file, every character is represented somewhere. In particular, the last character for each of those lines is a newline \n, which you're printing.

with open('test.txt') as f: data = list(f)

# data[0] == 'CGATGCATAGGAA\n'
# data[1] == 'GCAGGAGTGATCC\n'

So the very last substring you're trying to print from the first line is 'GGAA\n', which has a length of 5, but it's giving you the extra whitespace and the appearance of 4mers. One of the comments proposed a satisfactory solution, but when you know the root of the problem you have lots of options:

with open('test.txt', 'r') as file:
    for line_no, line in enumerate(file):
        if line_no: print()  # for the space between chunks which you seem to want in your final output -- omit if not desired
        line = line.strip()  # remove surrounding whitespace, including the pesky newlines
        for i in range(len(line)):
            kmer = str(line[i:i+5])
            if len(kmer) == 5:
                print(kmer)
            else:
                pass
Hans Musgrave
  • 6,613
  • 1
  • 18
  • 37