1

General problem: I try to transpose a large numpy matrix using matrix.T. It is working well when using a small test file. However, when using the big file only the first 3 and the last 3 lines are transposed but the lines in between (in total ~250,000) are not transposed and are print as '...'. In addition, only the first and last 3 nucleotides per line are displayed. Finally, it looks like that:

[['C' 'T' 'C' ..., 'A' 'C' 'T']

['C' 'T' 'A' ..., 'A' 'T' 'G']

['C' 'T' 'A' ..., 'G' 'C' 'A']

...,

['T' 'A' 'A' ..., 'G' 'A' 'T']

['T' 'A' 'A' ..., 'C' 'G' 'T']

['C' 'G' 'T' ..., 'A' 'A' 'G']]

This is my code:

import numpy as np
with open("temp1.txt","rt") as infile:
   matrix = np.matrix([list(line.strip()) for line in infile.readlines()])
   x = matrix.T
   file_temp2.write(str(x))

Explanation: 1. The temp1.txt includes ~ 250,000 DNA sequences with a length of 100 nucleotides (A, C, T and G). The lines are separated with "\n" after the 100 nucleotides. The first lines look like that:

CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTC TTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAG CAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTTTGTCGTAGATATTAGGTAATCTGTAAGTCAACTCATATACAACTCATAATTTAAAATAAAAT AAAAAAGTTGTAATTATTAATGATAGTTCTGTGATTCCTCCATGAATCACATCTGCTTGATTTTTCTTTCATAAATTTATAAGTAATACATTCTTATAAA TATATGGAAGATGTGAATGAAGTTTTGGTCCTGAATGTGGCCAAGGTTCCGTCATTTGGAGATACGAAATCAAATCTCCTTTAAGATTTTGTTTTTATAA

and so on

2. The temp1.txt is converted into the numpy matrix and finally transposed, which works fine using a test-file (containing only 10 sequences). However, in the big file the above mentioned general problem occurs when transposing.

?Solution?: Do you have an idea how to get the complete transposed matrix of the big file to be finally write into my temp2.txt for further analysis.


!!!Solution found: Finally, I found that I have to convert the matrix into a list before saving. I have to do y = np.array(x)[0:].tolist() first before writing into the file. Now it is working. The code now is:

import numpy as np 
   with open("temp1.txt","rt") as infile:
   matrix = np.matrix([list(line.strip()) for line in infile.readlines()])
   x = matrix.T
   y = np.array(x)[0:].tolist()
   z = str(y).replace("], [", "\n")
   file_temp2.write(str(z))
saanasum
  • 15
  • 3

2 Answers2

2

Your question is valid: consider

import numpy as np

x = np.asmatrix(np.arange(10))   #already np.arange behaves like this
y = np.asmatrix(np.arange(10000))

In [361]: str(x)
Out[361]: '[[0 1 2 3 4 5 6 7 8 9]]'

In [362]: str(y)
Out[362]: '[[   0    1    2 ..., 9997 9998 9999]]'

What's worse, the same behaviour is encountered with the numpy-specific method numpy.array_str():

In [379]: np.array_str(np.asarray(x))
Out[379]: '[[0 1 2 3 4 5 6 7 8 9]]'

In [380]: np.array_str(np.asarray(y))
Out[380]: '[[   0    1    2 ..., 9997 9998 9999]]'

I suggest looking at numpy.tofile():

In [381]: x.tofile("out.txt",sep=" ")

In [382]: y.tofile("out2.txt",sep=" ")

you can use it to output your strings in your desired format. The resulting files contain the (in my case, numeric) arrays as plain text:

$ wc out*.txt 
    0 10000 48889 out2.txt
    0    10    19 out.txt

the above output of the bash command wc indicates, in the second column, that out.txt contains 10 words, while out2.txt contains 10000, as they should. A visual inspection verifies that the result is OK.

  • Thanks so much. I also will try that. However, during you have been posting this post I found a solution. See "!!!Solution found" at the end of my question. Nevertheless, thank you! – saanasum Jan 02 '16 at 22:47
  • @saanasum yeah, thanks, I've seen it:) Your solution is mostly based on the fact that only `np.array`s produce this problem, but using `str(x)` with a *list* `x` contains every element. I think you could/should avoid the whole `np.matrix` business, and directly read into a list of your liking. Unless, of course, you need a matrix for intermediate operations. – Andras Deak -- Слава Україні Jan 02 '16 at 22:51
  • @ Andras Deak: The main goal is to determine the number of each nucleotide (A, C, T or G) for each position in each sequence. So finally, there are 100 positions in 250,000 sequences. The idea was to generate the matrix to be able to transpose it. The transposed matrix can be transformed into a list or string afterwards to count the number of each nucleotide in each line (which is the same as counting in each row in the intput file(temp1.txt)). – saanasum Jan 02 '16 at 22:58
0

If your problem description is complete, you could try something like this:

result = []
fin = open("c:/temp/temp.txt","r")
fout = open("c:/temp/temp2.txt","w")
for line in fin:
    result.append(tuple(line.strip())) #break into characters

for line in zip(*result):  #transpose
    fout.write("".join(line))  #join characters as string
    fout.write("\n")
Alan
  • 9,410
  • 15
  • 20
  • WOW! That´s great. Thank you! The transposition works. So numpy is not necessary. However, the "\n" after each DNA sequences is missing in the temp2.txt. But this might be a tiny problem. – saanasum Jan 02 '16 at 23:18
  • Maybe you lost the indent on the last line (above)? – Alan Jan 03 '16 at 14:42
  • @ Alan: Indent is present. However, I got it when doing this: for line in zip(*result): #transpose line = "".join(line) #join characters as string line = line+"\n" fout.write(str(line)) – saanasum Jan 03 '16 at 17:57