1

It is well known [1] [2] that numpy.loadtxt is not particularly fast in loading simple text files containing numbers.

I have been googling around for alternatives, and of course I stumbled across pandas.read_csv and astropy io.ascii. However, these readers don’t appear to be easy to decouple from their library, and I’d like to avoid adding a 200 MB, 5-seconds-import-time gorilla just for reading some ascii files.

The files I usually read are simple, no missing data, no malformed rows, no NaNs, floating point only, space or comma separated. But I need numpy arrays as output.

Does anyone know if any of the parsers above can be used standalone or about any other quick parser I could use?

Thank you in advance.

[1] Numpy loading csv TOO slow compared to Matlab

[2] http://wesmckinney.com/blog/a-new-high-performance-memory-efficient-file-parser-engine-for-pandas/

[Edit 1]

For the sake of clarity and to reduce background noise: as I stated at the beginning, my ascii files contain simple floats, no scientific notation, no fortran specific data, no funny stuff, no nothing but simple floats.

Sample:

{

arr = np.random.rand(1000,100)
np.savetxt('float.csv',arr)

}

Infinity77
  • 1,317
  • 10
  • 17
  • Similar current question, https://stackoverflow.com/questions/52232559/numpy-loadtxt-is-way-slower-than-open-readlines. Not a duplicate since it doesn't have an answer either. – hpaulj Sep 08 '18 at 19:32
  • Typically what's the shape of the loaded array? – hpaulj Sep 08 '18 at 19:49
  • Please provide some sample lines. – Mark Setchell Sep 08 '18 at 20:12
  • If import times are an issue, I'm wondering if you save some by just pulling in the relevant parts of `pandas.io` to avoid grabbing the full API. – fuglede Sep 08 '18 at 22:07
  • @hjpauli, it varies wildly, I have a few files containing data that is around 30x3, many others up to 10,000x9. – Infinity77 Sep 09 '18 at 05:03
  • @Mark Setchell: why? The question is clear as it stands, it doesn’t need code or samples. – Infinity77 Sep 09 '18 at 06:36
  • The reason I asked for a sample is that it is easy to spend 40 minutes answering a question and then find that the actual data/inputs or image are nothing like the description, e.g. they are in Fortran-style scientific notation. Another reason is that it is required by StackOverflow rules that a **"Minimal, Complete and Verifiable"** piece of code is provided which necessarily includes data if it is verifiable and can be run... https://stackoverflow.com/help/mcve – Mark Setchell Sep 09 '18 at 07:29
  • If your Input text-file is simple, why not try a simple c++ solution and wrap it with cython? But if you can avoid ascii-files at all, avoid it. Even the best solutions for reading/writing ascii files are very slow compared to binary files... – max9111 Sep 12 '18 at 07:56

1 Answers1

0

Personally I just use pandas and astropy for this. Yes, they are big and slow on import, but very widely available and on my machine import in under a second, so they aren't so bad. I haven't tried, but I would assume that extracting the CSV reader from pandas or astropy and getting it to build and run standalone isn't so easy, probably not a good way to go.

Is writing your own CSV to Numpy array reader an option? If the CSV is simple, it should be possible to do with ~ 100 lines of e.g. C / Cython, and if you know your CSV format you can get performance and package size that can't be beaten by a generic solution.

Another option you could look at is https://odo.readthedocs.io/ . I don't have experience with it, from a quick look I didn't see direct CSV -> Numpy. But it does make fast CSV -> database simple, and I'm sure there are fast database -> Numpy array options. So it might be possible to get fast e.g. CSV -> in-memory SQLite -> Numpy array via odo and possible a second package.

Christoph
  • 2,790
  • 2
  • 18
  • 23