2

I am trying to read a large number of small binary format files (~200,000) as numpy arrays into a dictionary in python:

import os
import numpy as np

def readfiles(limit):
    filelist = {}
    i=1
    for filename in os.listdir('folder'):
        filelist[filename] = np.fromfile('folder/'+filename, 'float32')
        i += 1
        if i > limit:
           break

    return filelist

The limit argument is just for testing with a smaller number of files, normally I would read all the files in the folder.

The first time I run the script with a fairly large limit (90,000), it takes ~68 s. If I immediately re-run the script it runs in ~1.2 s. The cProfiles give:

>>> cProfile.run('readfiles(90000)')

90005 function calls in 68.768 seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.284    0.284   68.690   68.690 <ipython-input-57-939c6a92cd68>:1(readfiles)
    1    0.079    0.079   68.768   68.768 <string>:1(<module>)
    1    0.000    0.000   68.768   68.768 {built-in method builtins.exec}
90000   68.313    0.001   68.313    0.001 {built-in method numpy.core.multiarray.fromfile}
    1    0.093    0.093    0.093    0.093 {built-in method posix.listdir}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run('readfiles(90000)')

90005 function calls in 1.970 seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.137    0.137    1.900    1.900 <ipython-input-57-939c6a92cd68>:1(readfiles)
    1    0.070    0.070    1.970    1.970 <string>:1(<module>)
    1    0.000    0.000    1.970    1.970 {built-in method builtins.exec}
90000    1.673    0.000    1.673    0.000 {built-in method numpy.core.multiarray.fromfile}
    1    0.090    0.090    0.090    0.090 {built-in method posix.listdir}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Subsequently, when I rerun the script in a completely different session, I still get ~1.2s. This seems rather strange to me. It seems that np.fromfile is not truly re-reading files after it has done once, but reading of some cached files the second time. But I have not heard of cached data being reused in another session in a situation like this. Is that right? If yes, how do I change this so that the code actually re-reads the files? If not, why does the first run takes so long?

I am using Python 3.5.1 with NumPy 1.11.2

Edit: By restarting the system I get the longer runtime back, so this must be a OS-level caching, as pointed out in the comments. Any way to around that without rebooting my system?

krm
  • 847
  • 8
  • 13
  • 1
    How completely different is your new session? Because I think there is some caching going on on the OS / filesystem level which would survive, say, your just starting a new python interpreter. No expert, though. – Paul Panzer Feb 07 '17 at 12:35
  • I only closed all open terminals and terminated all interactive python sessions. I didn't reboot the system. The code is supposed to run on a cluster so rebooting is not an option, although I can try and see if that helps on my machine. – krm Feb 07 '17 at 12:46
  • Yup! Restarting seems to clear whatever the cache was. I will edit the question with this information – krm Feb 07 '17 at 12:55
  • Which operating system are you running this on? – Warren Weckesser Feb 07 '17 at 15:05
  • I am using Ubuntu 14.04 – krm Feb 07 '17 at 15:55
  • In that case, the second paragraph in [this answer](http://stackoverflow.com/questions/15096269/the-fastest-way-to-read-input-in-python/15097561#15097561) shows how you can clear the disk cache. – Warren Weckesser Feb 07 '17 at 16:03
  • That works perfectly! Thanks. Will you be posting this as an answer? – krm Feb 08 '17 at 08:15

1 Answers1

0

As mentioned in the comments, the following command clears all disk caches.

sync; echo 3 > /proc/sys/vm/drop_caches
krm
  • 847
  • 8
  • 13