numpy matrix is not completely transposed

Question

General problem: I try to transpose a large numpy matrix using matrix.T. It is working well when using a small test file. However, when using the big file only the first 3 and the last 3 lines are transposed but the lines in between (in total ~250,000) are not transposed and are print as '...'. In addition, only the first and last 3 nucleotides per line are displayed. Finally, it looks like that:

[['C' 'T' 'C' ..., 'A' 'C' 'T']

['C' 'T' 'A' ..., 'A' 'T' 'G']

['C' 'T' 'A' ..., 'G' 'C' 'A']

...,

['T' 'A' 'A' ..., 'G' 'A' 'T']

['T' 'A' 'A' ..., 'C' 'G' 'T']

['C' 'G' 'T' ..., 'A' 'A' 'G']]

This is my code:

import numpy as np
with open("temp1.txt","rt") as infile:
   matrix = np.matrix([list(line.strip()) for line in infile.readlines()])
   x = matrix.T
   file_temp2.write(str(x))

Explanation: 1. The temp1.txt includes ~ 250,000 DNA sequences with a length of 100 nucleotides (A, C, T and G). The lines are separated with "\n" after the 100 nucleotides. The first lines look like that:

CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTC TTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAG CAAAGTTCTTCCGCCTGATTAATTATCCATTTTACCTTTGTCGTAGATATTAGGTAATCTGTAAGTCAACTCATATACAACTCATAATTTAAAATAAAAT AAAAAAGTTGTAATTATTAATGATAGTTCTGTGATTCCTCCATGAATCACATCTGCTTGATTTTTCTTTCATAAATTTATAAGTAATACATTCTTATAAA TATATGGAAGATGTGAATGAAGTTTTGGTCCTGAATGTGGCCAAGGTTCCGTCATTTGGAGATACGAAATCAAATCTCCTTTAAGATTTTGTTTTTATAA

and so on

2. The temp1.txt is converted into the numpy matrix and finally transposed, which works fine using a test-file (containing only 10 sequences). However, in the big file the above mentioned general problem occurs when transposing.

?Solution?: Do you have an idea how to get the complete transposed matrix of the big file to be finally write into my temp2.txt for further analysis.

!!!Solution found: Finally, I found that I have to convert the matrix into a list before saving. I have to do y = np.array(x)[0:].tolist() first before writing into the file. Now it is working. The code now is:

import numpy as np 
   with open("temp1.txt","rt") as infile:
   matrix = np.matrix([list(line.strip()) for line in infile.readlines()])
   x = matrix.T
   y = np.array(x)[0:].tolist()
   z = str(y).replace("], [", "\n")
   file_temp2.write(str(z))

I think it transpose. but the 3 dots are just presentation(prevent printing huge matrix on screen) — itai, Jan 02 '16 at 21:55
@itai: Thanks, but I forget to mention that this is not what is displayed on the screen. That´s the content of the temp2.txt file. The temp2.txt is only 2KB in size, which also shows that there is nothing more inside. — saanasum, Jan 02 '16 at 22:00
@ Alan: a huge 250,000 X 100 matrix (is that possible?). Using the test file it works (a matrix 10 X 100 is generated). — saanasum, Jan 02 '16 at 22:03
what happens if you do not convert it to a matrix but rather to an array ? — Moritz, Jan 02 '16 at 22:47
You'd have this problem whether you are trying to do the transpose or not. It's not about transpose. It's about how `numpy` displays (`str()`) a large array. At some size threshold it starts to use `...`. `str` is meant for convenient display during an interactive session, not for writing the whole array to a file. — hpaulj, Jan 02 '16 at 22:58
See http://stackoverflow.com/questions/1987694/print-the-full-numpy-array — hpaulj, Jan 02 '16 at 23:01
@hpaulj great link, although I wouldn't have guessed that the automatic printing and `str()/array_str()` are related. — Andras Deak -- Слава Україні, Jan 02 '16 at 23:08
Next time you find a solution yourself, you can answer your own question — Eric, Jan 02 '16 at 23:49
@Eric: Yes. But in this case it needed some help e.g. comment 1 by itai ... — saanasum, Jan 03 '16 at 16:49

Andras Deak -- Слава Україні · Answer 1 · 2016-01-02T22:44:46.100

2

Your question is valid: consider

import numpy as np

x = np.asmatrix(np.arange(10))   #already np.arange behaves like this
y = np.asmatrix(np.arange(10000))

In [361]: str(x)
Out[361]: '[[0 1 2 3 4 5 6 7 8 9]]'

In [362]: str(y)
Out[362]: '[[   0    1    2 ..., 9997 9998 9999]]'

What's worse, the same behaviour is encountered with the numpy-specific method numpy.array_str():

In [379]: np.array_str(np.asarray(x))
Out[379]: '[[0 1 2 3 4 5 6 7 8 9]]'

In [380]: np.array_str(np.asarray(y))
Out[380]: '[[   0    1    2 ..., 9997 9998 9999]]'

I suggest looking at numpy.tofile():

In [381]: x.tofile("out.txt",sep=" ")

In [382]: y.tofile("out2.txt",sep=" ")

you can use it to output your strings in your desired format. The resulting files contain the (in my case, numeric) arrays as plain text:

$ wc out*.txt 
    0 10000 48889 out2.txt
    0    10    19 out.txt

the above output of the bash command wc indicates, in the second column, that out.txt contains 10 words, while out2.txt contains 10000, as they should. A visual inspection verifies that the result is OK.

edited Jan 02 '16 at 22:44

answered Jan 02 '16 at 22:37

Andras Deak -- Слава Україні

33,737
11
83
111

Thanks so much. I also will try that. However, during you have been posting this post I found a solution. See "!!!Solution found" at the end of my question. Nevertheless, thank you! – saanasum Jan 02 '16 at 22:47
@saanasum yeah, thanks, I've seen it:) Your solution is mostly based on the fact that only `np.array`s produce this problem, but using `str(x)` with a *list* `x` contains every element. I think you could/should avoid the whole `np.matrix` business, and directly read into a list of your liking. Unless, of course, you need a matrix for intermediate operations. – Andras Deak -- Слава Україні Jan 02 '16 at 22:51
@ Andras Deak: The main goal is to determine the number of each nucleotide (A, C, T or G) for each position in each sequence. So finally, there are 100 positions in 250,000 sequences. The idea was to generate the matrix to be able to transpose it. The transposed matrix can be transformed into a list or string afterwards to count the number of each nucleotide in each line (which is the same as counting in each row in the intput file(temp1.txt)). – saanasum Jan 02 '16 at 22:58

score 0 · Accepted Answer · answered Jan 02 '16 at 23:08

0

If your problem description is complete, you could try something like this:

result = []
fin = open("c:/temp/temp.txt","r")
fout = open("c:/temp/temp2.txt","w")
for line in fin:
    result.append(tuple(line.strip())) #break into characters

for line in zip(*result):  #transpose
    fout.write("".join(line))  #join characters as string
    fout.write("\n")

answered Jan 02 '16 at 23:08

Alan

9,410
15
20

WOW! That´s great. Thank you! The transposition works. So numpy is not necessary. However, the "\n" after each DNA sequences is missing in the temp2.txt. But this might be a tiny problem. – saanasum Jan 02 '16 at 23:18
Maybe you lost the indent on the last line (above)? – Alan Jan 03 '16 at 14:42
@ Alan: Indent is present. However, I got it when doing this: for line in zip(*result): #transpose line = "".join(line) #join characters as string line = line+"\n" fout.write(str(line)) – saanasum Jan 03 '16 at 17:57

numpy matrix is not completely transposed

2 Answers2