fastest approach to read a big ascii file into a numpy array

Question

I have a text file with the size of 1505MB contains float data. The file has about 73000 rows and 1500 columns. I would like to read the content of the file into a numpy array and then perform some analysis on the array but my machine has been getting slow using numpy.readtxt to read the file. What is the fastest way to read this file into an array using python?

You say "getting slow". How slow are we talking here? And how much memory are you working with? — user2357112, Apr 05 '16 at 00:13
@user2357112 I have four cpu on my machine and they reached all 4 to 100% performance and basically I could not use my machine to do anything else. — Dalek, Apr 05 '16 at 00:15
Check http://stackoverflow.com/questions/15096269/the-fastest-way-to-read-input-in-python (Using pandas.read_csv with space as separators) — Lauro Moura, Apr 05 '16 at 00:17
@ChrisP The file contains the probability distributions for around 73000 objects. I don't know how sparse it is?!! — Dalek, Apr 05 '16 at 00:17
@LauroMoura by using `pandas.read_csv`, I got this error `pandas.parser.CParserError: Error tokenizing data. C error: out of memory`. — Dalek, Apr 05 '16 at 00:25
@Dalek See http://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv In short, there's a limit of 2gb for 32bit processes on Windows, but I don't know if this is the case. Anyway, there's also the option to tell pandas the types in the csv, which saves memory. — Lauro Moura, Apr 05 '16 at 00:30
73000 rows by 1500 columns into floats of 8 bytes each nets about 835MB, so you shouldn't run out of memory. It's obviously then the parsing part that causes problems. If everything else fails, you could try the old-fashioned hard way, iterate through each line yourself, splitting the line, casting the results and storing them in a pre-allocated numpy array. (Addendum: as per Saulio's answer, which showed up just around when I entered this comment.) — , Apr 05 '16 at 00:30

score 6 · Answer 1 · answered Apr 05 '16 at 09:16

You can also use the pandas reader, which is optimized :

In [3]: savetxt('data.txt',rand(10000,100))

In [4]: %time u=loadtxt('data.txt')
Wall time: 7.21 s

In [5]: %time u= read_large_txt('data.txt',' ')
Wall time: 3.45 s

In [6]: %time u=pd.read_csv('data.txt',' ',header=None).values
Wall time: 1.41 s

Saullo G. P. Castro · Answer 2 · 2016-04-05T09:29:31.333

3

The following function allocates the right amount of memory needed to read a text file.

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out

It allocates the memory by knowing beforehand the number of rows, columns and the data type. You could easily add some extra arguments found in np.loadtxt or np.genfromtxt such as skiprows, usecols and so forth.

Important:

As well observed by @Evert, out[i] = line.split(delimiter) seems wrong, but NumPy converts the string to dtype without requiring additional handling of data types here. There are some limits though.

edited Apr 05 '16 at 09:29

answered Apr 05 '16 at 00:30

Saullo G. P. Castro

56,802
26
179
234

1

There's no cast to the datatype. `line.split` returns an array of strings, so you'll want to cast that to a 1D numpy array of dtype first. – Apr 05 '16 at 00:32
@Evert believe me, it works. Probably NumPy is doing the convertion while assigning the values to the array – Saullo G. P. Castro Apr 05 '16 at 00:34
That, in a sense, scares the hell out of me: can it break, and when (under what conditions)? Is this behaviour documented somewhere? – Apr 05 '16 at 00:38
@Evert [here is some reference](http://docs.scipy.org/doc/numpy/user/basics.indexing.html#assigning-values-to-indexed-arrays) – Saullo G. P. Castro Apr 05 '16 at 00:40
@SaulloCastro In the case the file has a first line with `#` your function will break down, which my file has. – Dalek Apr 05 '16 at 00:46
2

@Dalek: it shouldn't be too hard to modify that function to ignore those lines, right? – Apr 05 '16 at 00:55
@downvoter tou shout leave a comment to let us know why you don't agree with this approach – Saullo G. P. Castro Apr 05 '16 at 09:28

fastest approach to read a big ascii file into a numpy array

2 Answers2

Important: