-1

I have a file with sequences like this:

>info
ATG
>info
GA
>info
TTAG
>info
ATTTT

I'd like to read this into a matrix:

matrix[0][0]=A , matrix[0][1]=T, matrix[0][2]=G
matrix[1][0]=G , matrix[1][1]=A
matrix[2][0]=T , matrix[2][1]=T, matrix[2][2]=A , matrix[2][3]=G
ETC...

Is this even possible in Python (pycharm), and if it is, how could I do that?

NEW code so far:

def read(sek):
listA=[]
regex = re.compile(r"[;>](?P<description>[^\n]*)\n(?P<sequence>[^;>]+)")
with open(sek, "r") as file:
     seq = regex.findall(file.read())
     for i, info in enumerate(seq):
        description, sequence = info
        for j < len(sequence):
            listA[i][j]= sequence
            j=j+1
        i=i+1
file.close()
return(listA)
read('sequence1.FASTA')

new error message: SyntaxError: invalid syntax

((original file has description lines, but I already have a solution for that so I didn't wrote it in this question))

Martin Evans
  • 45,791
  • 17
  • 81
  • 97

3 Answers3

0

You can use list:

c = [];
c.append(list("ATG"))
c.append(list("GA"))
c.append(list("TTAG"))
print c[2][1]

You can create the matrix simply like this:

[list(x) for x in open('datafile').read().split("\n")]

>>>> [['A', 'T', 'G'], ['G', 'A'], ['T', 'T', 'A', 'G'], ['A', 'T', 'T', 'T', 'T']]

In your code, the def block needs to be indented, just like while, for, if etc.

ergonaut
  • 6,929
  • 1
  • 17
  • 47
  • This is not helpful because the problem he's asking about has nothing to do with the actual parsing, he just has an indenterror. He can post a new question if his actual code has problems – en_Knight Oct 16 '15 at 15:14
  • ident thing is not the main problem – AmlesLausiv Oct 16 '15 at 15:33
0

The following would load your data from your text file:

def read(sek):
    listA = []
    with open(sek, "r") as file:
        for line1 in file:
            listA.append(list(next(file).strip()))
    return listA

print(read('sequence1.FASTA'))

This would display the following output:

[['A', 'T', 'G'], ['G', 'A'], ['T', 'T', 'A', 'G'], ['A', 'T', 'T', 'T', 'T']]

Or if you prefer to use regular expressions, the following should also work:

def read(sek):
    with open(sek, "r") as file:
        return [list(line) for line in re.findall(r'^([ATGC]+)', file.read(), re.M)]

Note, if the file is huge, the first version avoids loading the whole file into memory at once, but could be slower.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • This is not helpful because the problem he's asking about has nothing to do with the actual parsing, he just has an indenterror. He can post a new question if his actual code has problems – en_Knight Oct 16 '15 at 15:14
  • ident thing is not the main problem – AmlesLausiv Oct 16 '15 at 15:33
0
for j < len(sequence):

should be

while j < len(sequence):

To solve your syntax error.

C.B.
  • 8,096
  • 5
  • 20
  • 34