Numpy loading csv TOO slow compared to Matlab

Question

I posted this question because I was wondering whether I did something terribly wrong to get this result.

I have a medium-size csv file and I tried to use numpy to load it. For illustration, I made the file using python:

import timeit
import numpy as np

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')

And then, I tried two methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.loadtxt('./test.csv', delimiter=',')
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)

And the result shows that t1 = 32.159652940464184, t2 = 52.00093725634724.
However, When I tried using matlab:

tic
for i = 1:3
    my_data = dlmread('./test.csv');
end
toc

The result shows: Elapsed time is 3.196465 seconds.

I understand that there may be some differences in the loading speed, but:

This is much more than I expected;
Isn't it that np.loadtxt should be faster than np.genfromtxt?
I haven't tried python csv module yet because loading csv file is a really frequent thing I do and with the csv module, the coding is a little bit verbose... But I'd be happy to try it if that's the only way. Currently I am more concerned about whether it's me doing something wrong.

Any input would be appreciated. Thanks a lot in advance!

score 46 · Accepted Answer · edited May 23 '17 at 11:45

Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms

Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s

There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s

Thank you very much! The pd.read_csv works great for me - in fact it finished in only half the time that MATLAB took! And also thanks for the other two very informative methods with lighter weight. — Yuxiang Wang, Aug 16 '13 at 14:17
The speed is not the only thing to care about. As for me, both `np.genfromtxt` and `pd.read_csv` require more RAM than I have to read a 1,209,836,036 byte text file. The former does not care and hangs the system, however the latter throws an error. `np.fromfile` is almost 4 times quicker than `np.loadtxt`. The two do not take much memory to run. — StSav012, Mar 18 '20 at 12:36

Nico Schlömer · Answer 2 · 2021-10-19T16:33:15.027

I've performance-tested the suggested solutions with perfplot (a small project of mine) and found that

pandas.read_csv(filename)

is indeed the fastest solution (if more than 2000 entries are read, before that everything is in the range of milliseconds). It outperforms numpy's variants by a factor of about 10. (numpy.fromfile is here just for comparison, it cannot read actual csv files.)

Code to reproduce the plot:

import numpy
import pandas
import perfplot

numpy.random.seed(0)
filename = "a.txt"


def setup(n):
    a = numpy.random.rand(n)
    numpy.savetxt(filename, a)
    return None


def numpy_genfromtxt(data):
    return numpy.genfromtxt(filename)


def numpy_loadtxt(data):
    return numpy.loadtxt(filename)


def numpy_fromfile(data):
    out = numpy.fromfile(filename, sep=" ")
    return out


def pandas_readcsv(data):
    return pandas.read_csv(filename, header=None).values.flatten()


def kington(data):
    delimiter = " "
    skiprows = 0
    dtype = float

    def iter_func():
        with open(filename, "r") as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        kington.rowlength = len(line)

    data = numpy.fromiter(iter_func(), dtype=dtype).flatten()
    return data


b = perfplot.bench(
    setup=setup,
    kernels=[numpy_genfromtxt, numpy_loadtxt, numpy_fromfile, pandas_readcsv, kington],
    n_range=[2 ** k for k in range(23)],
)
b.save("out.png")

Daniel · Answer 3 · 2013-08-15T20:49:49.373

8

If you want to just save and read a numpy array its much better to save it as a binary or compressed binary depending on size:

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')
np.save('./testy', my_data)
np.savez('./testz', my_data)
del my_data

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.load('./testy.npy')
"""
stmt3 = """\
my_data = np.load('./testz.npz')['arr_0']
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)
t3 = timeit.timeit(stmt=stmt3, setup=setup_stmt, number=3)

genfromtxt 39.717250824
save 0.0667860507965
savez 0.268463134766

edited Aug 15 '13 at 20:49

answered Aug 15 '13 at 19:19

Daniel

19,179
7
60
74

Thank you Ophion! This is a great answer, and really useful - I have been using cPickle but now realized that np.savez is faster and more compact than cPickle, as long as only ndarray are used. I did not mark "accept" because in this question I was trying to read data from experiment data saved by LabVIEW. But still, thank you so much! – Yuxiang Wang Aug 16 '13 at 14:23
I believe this should be selected as the correct answer! Thank you @Ophion – Amir Dec 01 '15 at 18:46

Arvind · Answer 4 · 2017-06-09T10:04:37.293

2

Perhaps it's better to rig up a simple c code which converts the data to binary and have `numpy' read the binary file. I have a 20GB CSV file to read with the CSV data being a mixture of int, double, str. Numpy read-to-array of structs takes more than an hour, while dumping to binary took about 2 minutes and loading to numpy takes less than 2 seconds!

My specific code, for example, is available here.

edited Jun 09 '17 at 10:04

answered Jun 25 '15 at 13:05

Arvind

124
6

Good results. Consider dropping sample codes for others. – I L Feb 08 '17 at 20:56

pbreach · Answer 5 · 2016-05-05T01:19:24.943

2

FWIW the built-in csv module works great and really is not that verbose.

csv module:

%%timeit
with open('test.csv', 'r') as f:
    np.array([l for l in csv.reader(f)])


1 loop, best of 3: 1.62 s per loop

np.loadtext:

%timeit np.loadtxt('test.csv', delimiter=',')

1 loop, best of 3: 16.6 s per loop

pd.read_csv:

%timeit pd.read_csv('test.csv', header=None).values

1 loop, best of 3: 663 ms per loop

Personally I like using pandas read_csv but the csv module is nice when I'm using pure numpy.

edited May 05 '16 at 01:19

answered May 05 '16 at 00:36

pbreach

16,049
27
82
120

1

I know this is an old question, but if you are still using pure numpy, you can still use pandas for IO and then use `pd.DataFrame.values to extract the numpy array. – jkr Nov 03 '16 at 00:11

Numpy loading csv TOO slow compared to Matlab

5 Answers5

Linked

Related