1

I need to load 1460 files into a list, from a folder with 163.360 files.

I use the following python code to do this:

import os
import glob

Directory = 'C:\\Users\\Nicolai\\Desktop\\sealev\\dkss_all'
stationName = '20002'
filenames = glob.glob("dkss."+stationName+"*")

This has been running fine so far, but today when I booted my machine and ran the code it was just stuck on the last line. I tried to reboot, and it didn't help, in the end I just let it run, went to lunch break, came back and it was finished. It took 45 minutes. Now when I run it it takes less than a second, what is going on? Is this a cache thing? How can I prevent having to wait 45 minutes again? Any explanations would be much appreciated.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
NicolaiF
  • 1,283
  • 1
  • 20
  • 44
  • Can you change filesystems? Some might do better than others here... but that's nothing you'll be able to fix from within your Python code. – Charles Duffy Mar 11 '15 at 12:20
  • http://stackoverflow.com/questions/5090418/is-there-a-way-to-efficiently-yield-every-file-in-a-directory-containing-million – RvdK Mar 11 '15 at 12:26
  • @RvdK, nice -- that's both better-written and more on-point than http://stackoverflow.com/questions/4403598/list-files-in-a-folder-as-a-stream-to-begin-process-immediately. – Charles Duffy Mar 11 '15 at 12:27
  • BTW, if you could move each station into its own subdirectory, that would make this much more efficient. – Charles Duffy Mar 11 '15 at 12:29
  • If you look at how NNTP spools are designed -- with IDs hashed into small directories... well, now you know why. (Granted, that's mostly to improve lookup of a _known_ ID, which filesystems with indexed directories -- which is an optional feature in Linux's ext3 and ext4 -- also solve). – Charles Duffy Mar 11 '15 at 12:33

2 Answers2

3

Yes, it is a caching thing. Your harddisk is a slow peripheral, reading 163.360 filenames from it can take some time. Yes, your operating system caches that kind of information for you. Python has to wait for that information to be loaded before it can filter out the matching filenames.

You don't have to wait all that time again until your operating system decides to use the memory caching the directory information for something else, or you restart the computer. Since you rebooted your computer, the information was no longer cached.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

Presuming that ls on that same directory is just as slow, you can't reduce the total time needed for the directory listing operation. Filesystems are slow sometimes (which is why, yes, the operating system does cache directory entries).

However, there actually is something you can do in your Python code: You can operate on filenames as they come in, rather than waiting for the entire result to finish before the rest of your code even starts. Unfortunately, this is functionality not present in the standard library, meaning you need to call C functions.

See Ben Hoyt's scandir module for an implementation of this. See also this StackOverflow question, describing the problem.

Using scandir might look something like the following:

prefix = 'dkss.%s.' % stationName
for direntry in scandir(path='.'):
  if direntry.name.startswith(prefix):
    pass # do whatever work you want with this file here.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441