Effeciently counting number of columns of text file

Question

I have a bunch of large tab-delimited text files, with a format similar to:

a   0.0694892   0   0.0118814   0   -0.0275522  
b   0.0227414   -0.0608639  0.0811518   -0.15216    0.111584    
c   0   0.0146492   -0.103492   0.0827939   0.00631915

To count the number of columns I have always used:

>>> import numpy as np
>>> np.loadtxt('file.txt', dtype='str').shape[1]
6

However, this method is obviously not efficient for bigger files, as the entire file content is loaded into the array before getting the shape. Is there a simple method, which is more efficient?

Traditional `open` will let you load a line at a time; see e.g. http://stackoverflow.com/q/6475328/3001761 — jonrsharpe, Jul 28 '14 at 20:57

score 3 · Answer 1 · answered Jul 28 '14 at 20:57

3

You don't need numpy for this; just read one line, split it on tabs and find the length of the list:

with open('file.txt', 'rb') as f:
    line = next(f) # read 1 line
    n = len(line.split('\t'))

if later you wish to load the entire array, you can do that with:

f.seek(0)
arr = np.loadtxt(f)

answered Jul 28 '14 at 20:57

unutbu

842,883
184
1,785
1,677

Thank you, simple and powerful answer! – dwitvliet Jul 28 '14 at 21:03

score 2 · Accepted Answer · answered Jul 28 '14 at 20:58

2

If you want to make sure you're using the exact same format as NumPy, the simplest solution is to feed it a wrapper around the first line.

If you look at the docs for loadtxt, the fname parameter can be:

File, filename, or generator to read.

In fact, it doesn't even really have to be a generator; any iterable works fine. Like, say, a list. So:

 with open('file.txt', 'rb') as f:
     lines = [f.readline()]
 np.loadtxt(lines, dtype='str').shape[1]

In other words, we just read the first line, stick it in a one-element list, and pass that to loadtxt and it parses it as if it were a one-line file.

answered Jul 28 '14 at 20:58

abarnert

354,177
51
601
671

Thank you, I did not know `loadtxt` accepted iterables too. This is exactly what I was looking for. – dwitvliet Jul 28 '14 at 21:06
1

@Banana: Well, until the docs bug is fixed, I guess this is technically relying on an implementation artifact rather than documented behavior. If you want to be paranoid, you can always wrap any iterable up in a generator, like `(line for line in lines)`, or a file-like object, like `io.BytesIO(lines)`. But practically, this works and is safe, and it's a lot easier to read. – abarnert Jul 28 '14 at 21:52

Effeciently counting number of columns of text file

2 Answers2

Linked