1

File format:

>ackg_2341
ACGATACGACGACATCA
>ackg_7865
GCACTACGCAGAAACGAA
>...

I want to skip the line with '>' each time. I proposed doing this and it isn't working.

f = open("data.txt","r")
    lcs = ''
    if f.read(1)=='>':
        str1 = f.readline[1:]
    for line in f:
        if line.read(1)=='>'
        temp = ''.join(f.readline[1:])
        res = len(lcs_matrix(str1,temp))

        if len(lcs)<res:
            lcs = lcs_matrix(str1, temp)

print(lcs)

What am I doing wrong?

Traceback (most recent call last):
  File "shared_substr.py", line 83, in <module>
   print(DNA_multi_lcs())
  File "shared_substr.py", line 68, in DNA_multi_lcs
    str1 = f.readline[1:]
TypeError: 'builtin_function_or_method' object is not subscriptable
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • `f.readline` is a _method_, like the error message says. Methods and functions are not subscriptable. You meant to _call_ the method like `f.readline()` – ForceBru Jan 09 '20 at 20:50
  • 1
    `with open ("file.txt") as f:`; `for line in f:`; `if line.startswith(">"): continue` .... `else:` ... seems a more logical way to do what you want to do – Patrick Artner Jan 09 '20 at 21:54
  • @Patrick Artner I wanted to keep the first string separate and essentially do a comparison with the rest of the strings in the file. Do you know how to set the range of reading the file? For example the for loop should start from 3rd line and iterate through rest of lines in file. – gaurav shinde Jan 09 '20 at 22:03
  • Does this answer your question? [Read file from line 2 or skip header row](https://stackoverflow.com/questions/4796764/read-file-from-line-2-or-skip-header-row) – Patrick Artner Jan 10 '20 at 06:22
  • `f.readline()[2:]` will read the whole file into memory though, `next(f)` twice and `for line in f:` after will keep the memory low because you only read 1 line into memory, after skipping the first two. – Patrick Artner Jan 10 '20 at 06:24
  • @PatrickArtner This is what I wanted to do. So I only want data from the DNA code lines. My Longest Common Substring function takes the 1st line and compares with other strings in sequence to find the longest common substring. Psuedo Ex: str1 & str2 -> "ACG", str1 & str3 -> "ACG" (because lcs of str1 & str3 is "AG", but its still small), str1 & str4 --> "GACT". Return "GACT" – gaurav shinde Jan 19 '20 at 05:00

2 Answers2

0

simply do this

f = open("data.txt", "r")
n=1#lines you want to skip
for line in f.readlines()[n:]:
    if  line.startswith('>'):
        "what ever you want"
    else:
        print(line)
Andy_101
  • 1,246
  • 10
  • 20
0

The file format is commonly known as FASTA and is used in molecular biology for storing gene/protein sequences where lines beginning with > are the "headers" for each sequence, and other lines are the sequences.

For such files, you may also need to merge the sequence data which may span multiple lines. Below is a function that reads the file, splits headers and sequences, merges sequences (if they span multiple lines), and then returns a list of all headers and a list of all sequences in the file. You may consequently loop through the list of headers and sequences as you like.

def readFasta(fasta_file):
    with open(fasta_file, 'r') as fast:
        headers, sequences = [], []
        for line in fast:
            if line.startswith('>'):
                head = line.replace('>','').strip()
                headers.append(head)
                sequences.append('')
            else :
                seq = line.strip()
                if len(seq) > 0:
                    sequences[-1] += seq
    return [headers, sequences]

Example: Data in fasta.txt

>header1
ACGATACGACGACATCA
>header2
GCACTACGC
AGAAACGAA
>header3
ACGATCGA
ACGATTAC
[headers, seqdata] = readFasta(fasta.txt)
for i in range(len(headers)):
    print(headers[i])
    print(seqdata[i])
    print()

Output:

header1
ACGATACGACGACATCA

header2
GCACTACGCAGAAACGAA

header3
ACGATCGAACGATTAC
ProteinGuy
  • 1,754
  • 2
  • 17
  • 33