If you know the size of the array ahead of time, you could save time and space by loading each line into a target array as it is parsed.
For example:
In [173]: txt="""1,2,3,4,5,6,7,8,9,10
...: 2,3,4,5,6,7,8,9,10,11
...: 3,4,5,6,7,8,9,10,11,12
...: """
In [174]: np.genfromtxt(txt.splitlines(),dtype=int,delimiter=',',encoding=None)
Out[174]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])
With a simpler parsing function:
In [177]: def foo(txt,size):
...: out = np.empty(size, int)
...: for i,line in enumerate(txt):
...: out[i,:] = line.split(',')
...: return out
...:
In [178]: foo(txt.splitlines(),(3,10))
Out[178]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])
out[i,:] = line.split(',')
loading a list of strings into a numeric dtype array forces a conversion, the same as np.array(line..., dtype=int)
.
In [179]: timeit np.genfromtxt(txt.splitlines(),dtype=int,delimiter=',',encoding
...: =None)
266 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [180]: timeit foo(txt.splitlines(),(3,10))
19.2 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The simpler, direct parser is much faster.
However if I try a simplified version of what loadtxt
and genfromtxt
use:
In [184]: def bar(txt):
...: alist=[]
...: for i,line in enumerate(txt):
...: alist.append(line.split(','))
...: return np.array(alist, dtype=int)
...:
...:
In [185]: bar(txt.splitlines())
Out[185]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])
In [186]: timeit bar(txt.splitlines())
13 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For this small case it's even faster. genfromtxt
must have a lot of parsing overhead. This is a small sample, so memory consumption doesn't matter.
for completeness, loadtxt
:
In [187]: np.loadtxt(txt.splitlines(),dtype=int,delimiter=',')
Out[187]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])
In [188]: timeit np.loadtxt(txt.splitlines(),dtype=int,delimiter=',')
103 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
with fromiter
:
In [206]: def g(txt):
...: for row in txt:
...: for item in row.split(','):
...: yield item
In [209]: np.fromiter(g(txt.splitlines()),dtype=int).reshape(3,10)
Out[209]:
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])
In [210]: timeit np.fromiter(g(txt.splitlines()),dtype=int).reshape(3,10)
12.3 µs ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)