3

when comparing this two ways of doing the same thing:

import numpy as np
import time
start_time = time.time()
for j in range(1000):
    bv=np.loadtxt('file%d.dat' % (j+1))
    if(j%100==0):   
        print bv[300,0] 
T1=time.time() - start_time
print("--- %s seconds ---" % T1)

and

import numpy as np
import time
start_time = time.time()
for j in range(1000):
    a=open('file%d.dat' % (j+1),'r')
    b=a.readlines()
    a.close()
    for i in range(len(b)):
        b[i]=b[i].strip("\n")
        b[i]=b[i].split("\t")
        b[i]=map(float,b[i])
    bv=np.asarray(b)
    if(j%100==0):   
        print bv[300,0]  
T1=time.time() - start_time
print("--- %s seconds ---" % T1)

I have noticed that the second one is way faster. Is there any way to have something as concise as the first method and as fast as the second one? Why is loadtxt so slow with respect to performing the same task manually?

3sm1r
  • 520
  • 4
  • 19
  • 2
    BTW, you can make the 2nd version even faster & use less RAM. There's no need to load the whole file into a list with `.readlines`, you can loop directly over the lines. Eg, `for row in a:`. But even when you do want to iterate over a list, it's better to loop directly over the list items rather than messing with list indices. That gives you cleaner & faster code since you can do stuff like `row = row.rstrip('\n').split('\t')`. – PM 2Ring Sep 08 '18 at 06:28
  • 1
    ① [loadtxt is slow](https://stackoverflow.com/questions/26347297/using-numpy-loadtxt-for-loading-multiple-files-is-slow) ② [fastest way to read input](https://stackoverflow.com/questions/15096269/the-fastest-way-to-read-input-in-python) ③ [loadtxt slow wrt MATLAB](https://stackoverflow.com/questions/18259393/numpy-loading-csv-too-slow-compared-to-matlab) – gboffi Sep 08 '18 at 06:56
  • 1
    `loadtxt` is Python code; nothing specially compiled. It reads the file line by line, splits, partially parses the lines, and collects the data in a list of lists. Then builds array once at the end. `genfromtxt` is similar, but with a somewhat more sophisticated line and dtype handling. – hpaulj Sep 08 '18 at 07:02

1 Answers1

3

With a simple, not too large csv created with:

In [898]: arr = np.ones((1000,100))
In [899]: np.savetxt('float.csv',arr)

the loadtxt version:

In [900]: timeit data = np.loadtxt('float.csv')
112 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

fromfile can load text, though it doesn't preserve any shape info (no apparent speed advantage)

In [901]: timeit data = np.fromfile('float.csv', dtype=float, sep=' ').reshape(-1,100)
129 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the most concise version of the 'manual' that I can come up with:

In [902]: %%timeit
     ...: with open('float.csv') as f:
     ...:     data = np.array([line.strip().split() for line in f],float)
52.9 ms ± 589 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This 2x improvement over loadtxt seems typical of variations on this.

pd.read_csv is about the same time.

genfromtxt is a bit faster than loadtxt:

In [907]: timeit data = np.genfromtxt('float.csv')
98.2 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I did it with pd.read_csv. It works quite well, but I have to use an additional command to switch to the array type, that is file=file.values – 3sm1r Sep 08 '18 at 20:21