One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()
) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)
See the update below before believing what I write underneath
If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.
There are three things to try, though:
- Python
csv
package and csv.reader
- NumPy
genfromtext
- Numpy
loadtxt
The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)
Also, the suggestions given you in the comments by crclayton
, BKay
, and EdChum
are good ones.
Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).
Update: I do believe what chrisb
says below, i.e. the pandas
parser is fast.
Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.