numpy loadtxt takes so much time

Question

For some reason I split my code into 2 parts; first part is written in C and second part with python. I wrote the output of C code in file and use it in python as my input, now my problem is when I want to load the file into the numpy array in takes about 18 minutes which is a lot and I need to reduce this time. the size of fie is around 300MB.

The C code for writing into the file is like:

struct point {
    float fpr;
    float tpr;
    point(float x, float y)
    {
        fpr = x;
        tpr = y;
    }
};
vector<point> current_points;
// filling current_points ......
ofstream files;
files.open ("./allpoints.txt")
for(unsigned int i=0; i<current_points.size(); i++)
            files << current_points[i].fpr << '\t' << current_points[i].tpr << "\n";

And reading the file in python is like:

with open("./allpoints.txt") as f:
    just_comb = numpy.loadtxt(f) #The problem is here (took 18 minutes)

The allpoints.txt is like this (As you can see it's coordination of some points in 2D dimension):

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0

Now my question is that, is there any better way to store the vector of points in file (something like binary format ) and read it in python into 2D numpy array faster?

http://stackoverflow.com/questions/15096269/the-fastest-way-to-read-input-in-python/15097561#15097561 — Warren Weckesser, Mar 05 '15 at 14:40

score 3 · Answer 1 · answered Mar 05 '15 at 04:12

3

If you want a prebaked library solution, use HDF5. If you want something more bare-bones without dependencies, do this:

files.write(reinterpret_cast<char*>(current_points.data()),
    current_points.size() * sizeof(point));

This will give you a simple 2D array of floats written directly into the file. You can then read this file with [numpy.fromfile()][1].

answered Mar 05 '15 at 04:12

John Zwinck

239,568
38
324
436

I am using HDF5 in my python code but the question is how can I write my output into it in C. – Am1rr3zA Mar 05 '15 at 04:13
http://www.hdfgroup.org/HDF5/examples/api18-c.html - the C API for HDF5 is not super easy to use, but it's not impossible either. – John Zwinck Mar 05 '15 at 04:14
I tried [numpy.fromfile()][1] but it gave me error since I think it needs file descriptor tehn I changed it to [numpy.fromfile()][1] yet I have problem with index out of range finally I have changed it to np.fromfile(f) but it read into 1D numpy array and the values are not correct. any idea? – Am1rr3zA Mar 05 '15 at 04:34
actually I have changed it to np.fromfile(f, dtype=np.float32, sep="\t") but still my problem is it will read in into 1D numpy array I need to make 2D somehow – Am1rr3zA Mar 05 '15 at 04:46
2

Just use `numpy.reshape` – Andrew Carter Mar 05 '15 at 05:05
@Am1rr3zA: as Andrew Carter says, use reshape or else you can set a 2D dtype, either way amounts to the same thing. – John Zwinck Mar 05 '15 at 05:51

score 1 · Accepted Answer · answered Mar 05 '15 at 04:48

1

Have you tried numpy.fromfile?

>>> import numpy
>>> data = numpy.fromfile('./allpoints.txt', dtype=float, count=-1, sep=' ')
>>> data = numpy.reshape(data, (len(data) / 2, 2))
>>> print(data[0:10])
[[ 0.989703   1.       ]
 [ 0.         0.       ]
 [ 0.0102975  0.       ]
 [ 0.0102975  0.       ]
 [ 1.         1.       ]
 [ 0.989703   1.       ]
 [ 1.         1.       ]
 [ 0.         0.       ]
 [ 0.0102975  0.       ]
 [ 0.989703   1.       ]]

This took 20 seconds for me with 300M input file.

answered Mar 05 '15 at 04:48

Andrew Carter

371
1
4

Actually right now I am working with numpy.fromfile; yet it took 2 minutes for me do you have any other suggestion to improve reading file since I need to run this code over thousands of different file and I want to be able to get the result in a second! – Am1rr3zA Mar 05 '15 at 04:58
Basically, write the data as binary. As a quick test, I wrote data out as .npy file (which was 50% bigger, to my surprise) and I got a less than 1 second read time. You can try to get the C to write out binary that fromfile can read or use this library https://github.com/rogersce/cnpy or follow this format https://github.com/numpy/numpy/blob/master/doc/neps/npy-format.rst or use HDF5 as John suggests. Or write your front end in python and then you can use `numpy.save` – Andrew Carter Mar 05 '15 at 05:22

numpy loadtxt takes so much time

2 Answers2