What is the fastest way to read in a large data file of text columns?

Question

I have a data file of almost 9 million lines (soon to be more than 500 million lines) and I'm looking for the fastest way to read it in. The five aligned columns are padded and separated by spaces, so I know where on each line to look for the two fields that I want. My Python routine takes 45 secs:

import sys,time

start = time.time()
filename = 'test.txt'    # space-delimited, aligned columns
trans=[]
numax=0
for line in open(linefile,'r'):
    nu=float(line[-23:-11]); S=float(line[-10:-1])
    if nu>numax: numax=nu
    trans.append((nu,S))
end=time.time()
print len(trans),'transitions read in %.1f secs' % (end-start)
print 'numax =',numax

whereas the routine I've come up with in C is a more pleasing 4 secs:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define BPL 47
#define FILENAME "test.txt"
#define NTRANS 8858226

int main(void) {
  size_t num;
  unsigned long i;
  char buf[BPL];
  char* sp;
  double *nu, *S;
  double numax;
  FILE *fp;
  time_t start,end;

  nu = (double *)malloc(NTRANS * sizeof(double));
  S = (double *)malloc(NTRANS * sizeof(double));

  start = time(NULL);
  if ((fp=fopen(FILENAME,"rb"))!=NULL) {
    i=0;
    numax=0.;
    do {
      if (i==NTRANS) {break;}
      num = fread(buf, 1, BPL, fp);
      buf[BPL-1]='\0';
      sp = &buf[BPL-10]; S[i] = atof(sp);
      buf[BPL-11]='\0';
      sp = &buf[BPL-23]; nu[i] = atof(sp);
      if (nu[i]>numax) {numax=nu[i];}
      ++i;
    } while (num == BPL);
    fclose(fp);
    end = time(NULL);
    fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax);
    fprintf(stdout, "that took %.1f secs\n", difftime(end,start));
  } else {
    fprintf(stderr, "Error opening file %s\n", FILENAME);
    free(nu); free(S);
    return EXIT_FAILURE;
  }

  free(nu); free(S);
  return EXIT_SUCCESS;
  }

Solutions in Fortran, C++ and Java take intermediate amounts of time (27 secs, 20 secs, 8 secs). My question is: have I made any outrageous blunders in the above (particularly the C-code)? And is there any way to speed up the Python routine? I quickly realised that storing my data in an array of tuples was better than instantiating a class for each entry.

Please explain the magic numbers in your C code (Where do you come up with 47 and 858226?) — mikerobi, Sep 23 '10 at 13:57
You should run the profiler on your python code to see where it's slow. Also you should try to follow the pep8 style conventions for python, they make it much much easier to read. — Daenyth, Sep 23 '10 at 13:58
For python: http://stackoverflow.com/questions/1896674/python-how-to-read-huge-text-file-into-memory — karlphillip, Sep 23 '10 at 14:01
sorry - BPL=47 is the number of bytes per line, including the \n EOL character; 8588226 is the total number of lines in the file - so I know how much memory I'll need to store the data in. — Chris, Sep 23 '10 at 14:02
in the python side, i guess you could speed it perceptibly by returning an iterator instead of building an array. — Javier, Sep 23 '10 at 14:03
In the C code, you could `fread` for more than one line at a time. It may or may not actually help on any given system: depends on buffering and on the cost of a system call. — Steve Jessop, Sep 23 '10 at 14:05
Speaking of outrageous blunders, though, you should check the return value of `fread` before using the results, rather than before repeating the loop :-) — Steve Jessop, Sep 23 '10 at 14:09
I would use a generator to yield (nu, S) pairs instead of initialize+append, but this probably won't give you more speed. — tokland, Sep 23 '10 at 14:10
Thanks for your answers - I've got a factor of 2 back already on the Python code by loading them into a numpy array: trans = np.zeros(NTRANS, dtype=[('nu', np.double), ('S', np.double)]) — Chris, Sep 23 '10 at 14:42
You are creating 27 million objects here : (nu,S) * 9,000,000 = 27,000,000. It is pushing the envelope for Python in terms of performance. It will be fruitful to try numpy or John Machin's suggestoin on array. — Wai Yip Tung, Sep 23 '10 at 16:51
Python is very slow at input and output. Do profiling, but IME you'll likely find that this is your bottleneck. The only way around this is to use a different language (like C, as it happens). — Alex Reynolds, Jul 12 '13 at 22:15

score 4 · Answer 1 · answered Sep 23 '10 at 14:38

4

Some points:

Your C routine is cheating; it is being tipped off with the filesize, and is pre-allocating ...
Python: consider using array.array('d') ... one each for S and nu. Then try pre-allocation.
Python: write your routine as a function and call it -- accessing function-local variables is rather faster than accessing module-global variables.

answered Sep 23 '10 at 14:38

John Machin

81,303
11
141
189

1

Thanks - I reckon your point 2. is why reading into pre-allocated numpy arrays got me down to 20 secs. Calling it as a function and we're down to 17.3 secs. Can anyone give any pointers on how to implement the iterator method? – Chris Sep 23 '10 at 14:50

Blagovest Buyukliev · Answer 2 · 2010-09-23T14:47:55.140

3

In the C implementation, you could try swapping the fopen()/fread()/fclose() library functions for the lower-level system calls open()/read()/close(). A speedup may come from the fact that fread() does a lot of buffering, whereas read() does not.

Additionally, calling read() less often with bigger chunks will reduce the number of system calls and therefore you'll have less switching between userspace and kernelspace. What the kernel does when you issue a read() system call (doesn't matter if it was invoked from the fread() library function) is read the data from the disk and then copy it to the userspace. The copying part becomes expensive if you issue the system call very often in your code. By reading in larger chunks you'll end up with less context switches and less copying.

Keep in mind though that read() isn't guaranteed to return a block of the exact number of bytes you wanted. This is why in a reliable and proper implementation you always have to check the return value of the read().

edited Sep 23 '10 at 14:47

answered Sep 23 '10 at 14:05

Blagovest Buyukliev

42,498
14
94
130

I don't understand. First you suggest not to use buffering by changing from fread()/... functions to read()/... functions. Then you suggest to use buffering by reading larger chunks of data. IMHO you suggest to avoid automatic buffering, and to actually use manual buffering with all the possible bug in a faulty implementation. – Didier Trosset Sep 23 '10 at 14:36
A proper implementation will always check the number of bytes actually read by `read()`. The point was that calling `read(2)` with a 47-byte memory block is far less efficient than doing it with a 1024-byte block for example. – Blagovest Buyukliev Sep 23 '10 at 14:43
1

Sure, but that's exactly why you use `fread()` instead - it'll call `read()` with a larger value and do the buffering for you. – caf Sep 24 '10 at 00:38

score 3 · Answer 3 · answered Sep 23 '10 at 14:22

3

An approach that could probably be applied to the C, C++ and python version would be to use memory map the file. The most signficant benefit is that it can reduce the amount of double-handling of data as it is copied from one buffer to another. In many cases there are also benefits due to the reduction in the number of system calls for I/O.

answered Sep 23 '10 at 14:22

torak

5,684
21
25

+1; memory mappping can be done natively in Java and in Fortran via C bindings as well... – Christoph Sep 23 '10 at 15:51

score 1 · Answer 4 · answered Sep 24 '10 at 00:35

You have the 1 and the BPL arguments the wrong way around in fread() (the way you have it, it could read a partial line, which you don't test for). You should also be testing the return value of fread() before you try and use the returned data.

You can might be able to speed the C version up a bit by reading more than a line at a time

#define LINES_PER_READ 1000
char buf[LINES_PER_READ][BPL];

/* ... */

   while (i < NTRANS && (num = fread(buf, BPL, LINES_PER_READ, fp)) > 0) {
      int line;

      for (line = 0; i < NTRANS && line < num; line++)
      {
          buf[line][BPL-1]='\0';
          sp = &buf[line][BPL-10]; S[i] = atof(sp);
          buf[line][BPL-11]='\0';
          sp = &buf[line][BPL-23]; nu[i] = atof(sp);
          if (nu[i]>numax) {numax=nu[i];}
          ++i;
      }
    }

On systems supporting posix_fadvise(), you should also do this upfront, after opening the file:

posix_fadvise(fileno(fp), 0, 0, POSIX_FADV_SEQUENTIAL);

score -1 · Answer 5 · answered Sep 23 '10 at 14:22

-1

Another possible speed-up, given the number of times you need to do it, is to use pointers to S and nu instead of indexing into arrays, e.g.,

double *pS = S, *pnu = nu;
...
*pS++ = atof(sp);
*pnu = atof(sp);
...

Also, since you are always converting from char to double at the same locations in buf, pre-compute the addresses outside of your loop instead of computing them each time in the loop.

answered Sep 23 '10 at 14:22

sizzzzlerz

4,277
3
27
35

2

Please measure without doing this. It does not help readability, and has usually no effect at all on performance. – Didier Trosset Sep 23 '10 at 14:38
Thanks for your answer - I get a segmentation fault, when I do this, though? Also, if BPL is #defined, won't the compiler do this pre-computing before execution for me? – Chris Sep 23 '10 at 14:39
I think most modern compilers would perform these kinds of optimizations automatically. Hence Didier's Comment. – torak Sep 23 '10 at 14:47

What is the fastest way to read in a large data file of text columns?

5 Answers5