2

I have a bunch of large tab-delimited text files, with a format similar to:

a   0.0694892   0   0.0118814   0   -0.0275522  
b   0.0227414   -0.0608639  0.0811518   -0.15216    0.111584    
c   0   0.0146492   -0.103492   0.0827939   0.00631915

To count the number of columns I have always used:

>>> import numpy as np
>>> np.loadtxt('file.txt', dtype='str').shape[1]
6

However, this method is obviously not efficient for bigger files, as the entire file content is loaded into the array before getting the shape. Is there a simple method, which is more efficient?

dwitvliet
  • 7,242
  • 7
  • 36
  • 62

2 Answers2

3

You don't need numpy for this; just read one line, split it on tabs and find the length of the list:

with open('file.txt', 'rb') as f:
    line = next(f) # read 1 line
    n = len(line.split('\t'))

if later you wish to load the entire array, you can do that with:

f.seek(0)
arr = np.loadtxt(f)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

If you want to make sure you're using the exact same format as NumPy, the simplest solution is to feed it a wrapper around the first line.

If you look at the docs for loadtxt, the fname parameter can be:

File, filename, or generator to read.

In fact, it doesn't even really have to be a generator; any iterable works fine. Like, say, a list. So:

 with open('file.txt', 'rb') as f:
     lines = [f.readline()]
 np.loadtxt(lines, dtype='str').shape[1]

In other words, we just read the first line, stick it in a one-element list, and pass that to loadtxt and it parses it as if it were a one-line file.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you, I did not know `loadtxt` accepted iterables too. This is exactly what I was looking for. – dwitvliet Jul 28 '14 at 21:06
  • 1
    @Banana: Well, until the docs bug is fixed, I guess this is technically relying on an implementation artifact rather than documented behavior. If you want to be paranoid, you can always wrap any iterable up in a generator, like `(line for line in lines)`, or a file-like object, like `io.BytesIO(lines)`. But practically, this works and is safe, and it's a lot easier to read. – abarnert Jul 28 '14 at 21:52