3

I have 60mb file with lots of lines.

Each line has the following format:

(x,y)

Each line will be parsed as a numpy vector at shape (1,2).

At the end it should be concatenated into a big numpy array at shpae (N,2) where N is the number of lines.

What is the fastest way to do that? Because now it takes too much time(more than 30 min).

My Code:

with open(fname) as f:
for line in f:
    point = parse_vector_string_to_array(line)
    if points is None:
        points = point
    else:
        points = np.vstack((points, point))

Where the parser is:

def parse_vector_string_to_array(string):
    x, y =eval(string)
    array = np.array([[x, y]])
    return array
member555
  • 797
  • 1
  • 13
  • 40
  • Have you looked at the actual [`numpy.loadtxt`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) function? This is exactly what it was written for. – Cory Kramer Aug 20 '15 at 19:31
  • @CoryKramer what text format is it expects there? Should i change it?4 – member555 Aug 20 '15 at 19:33
  • 3
    Definitely do *not* do this: `points = np.vstack((points, point))`. That results in `points` being copied for every new line. Instead, make `points` a python list, and append to it. Don't convert it to a numpy array until you have finished reading the file. – Warren Weckesser Aug 20 '15 at 19:40
  • 3
    If you can change the format of the file, get rid of the parentheses. Those are unusual to have in a text file, and will require special processing. (Of course, if you have control over the format, and you care about performance, you should consider a binary format instead of text.) – Warren Weckesser Aug 20 '15 at 19:46
  • @WarrenWeckesser can you give me please more about binary format? – member555 Aug 20 '15 at 19:50
  • 1
    @member555: See the [Numpy documentation on input and output](http://docs.scipy.org/doc/numpy/reference/routines.io.html). The first block of routines deals with Numpy's custom binary format (.npy and .npz files), but there are also routines to read raw binary files. – Sven Marnach Aug 20 '15 at 19:54
  • 2
    @member555 [this question is much related](http://stackoverflow.com/a/26570772/832621), from where you can get some insight. The best way I found is to create a temporary array and populate it while you go through the file. – Saullo G. P. Castro Aug 20 '15 at 19:56
  • @SvenMarnach But to create such a file i need to create a np array... – member555 Aug 20 '15 at 20:38
  • 1
    @member555: No, you actually don't. Where does the data come from? It must be written by some other program. If that other program is written in Python, it could write the data in .npy format. If it's in a different programming language, you could write raw binary files, or use a more portable format like [netCDF](http://unidata.github.io/netcdf4-python/) or [HDF5](http://www.h5py.org/). – Sven Marnach Aug 20 '15 at 21:14
  • @SvenMarnach thank you i will check this out! – member555 Aug 20 '15 at 21:17
  • Just one more thing – 60M megabytes is a tiny amount of data for modern computers, and performance shouldn't be an issue. You can probably read the data in a second even in text format, if you don't use a quadratic-time approach as in the code you posted. – Sven Marnach Aug 20 '15 at 21:17
  • @SvenMarnach with the current code its not seconds, after i switched to loadtxt its taking much less time. as i see, i can create .npy file only from np array, its not the case here.. – member555 Aug 20 '15 at 21:59

1 Answers1

2

One thing that would improve speed is to imitate genfromtxt and accumulate each line in a list of lists (or tuples). Then do one np.array at the end.

for example (roughly):

points = []
for line in file:
    x,y = eval(line)
    points.append((x,y))
result = np.array(points)

Since your file lines look like tuples I'll leave your eval parsing. We don't usually recommend eval, but in this limited case it might the simplest.

You could try to make genfromtxt read this, but the () on each line will give some headaches.

pandas is supposed to have a faster csv reader, but I don't know if it can be configured to handle this format or now.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • If anything, user `ast.literal_eval()` – it's always better not to execute arbitrary code from an input file if we don't have to. – Sven Marnach Aug 21 '15 at 11:30