I am trying to read a large number of small binary format files (~200,000) as numpy arrays into a dictionary in python:
import os
import numpy as np
def readfiles(limit):
filelist = {}
i=1
for filename in os.listdir('folder'):
filelist[filename] = np.fromfile('folder/'+filename, 'float32')
i += 1
if i > limit:
break
return filelist
The limit
argument is just for testing with a smaller number of files, normally I would read all the files in the folder.
The first time I run the script with a fairly large limit (90,000), it takes ~68 s. If I immediately re-run the script it runs in ~1.2 s. The cProfiles give:
>>> cProfile.run('readfiles(90000)')
90005 function calls in 68.768 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.284 0.284 68.690 68.690 <ipython-input-57-939c6a92cd68>:1(readfiles)
1 0.079 0.079 68.768 68.768 <string>:1(<module>)
1 0.000 0.000 68.768 68.768 {built-in method builtins.exec}
90000 68.313 0.001 68.313 0.001 {built-in method numpy.core.multiarray.fromfile}
1 0.093 0.093 0.093 0.093 {built-in method posix.listdir}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
>>> cProfile.run('readfiles(90000)')
90005 function calls in 1.970 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.137 0.137 1.900 1.900 <ipython-input-57-939c6a92cd68>:1(readfiles)
1 0.070 0.070 1.970 1.970 <string>:1(<module>)
1 0.000 0.000 1.970 1.970 {built-in method builtins.exec}
90000 1.673 0.000 1.673 0.000 {built-in method numpy.core.multiarray.fromfile}
1 0.090 0.090 0.090 0.090 {built-in method posix.listdir}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Subsequently, when I rerun the script in a completely different session, I still get ~1.2s. This seems rather strange to me. It seems that np.fromfile
is not truly re-reading files after it has done once, but reading of some cached files the second time. But I have not heard of cached data being reused in another session in a situation like this. Is that right? If yes, how do I change this so that the code actually re-reads the files? If not, why does the first run takes so long?
I am using Python 3.5.1 with NumPy 1.11.2
Edit: By restarting the system I get the longer runtime back, so this must be a OS-level caching, as pointed out in the comments. Any way to around that without rebooting my system?