Is there a faster alternative to os.walk
?
Yes. In fact, multiple.
scandir
(which will be in the stdlib in 3.5) is significantly faster than walk
.
- The C function
fts
is significantly faster than scandir
. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes
or cffi
if you know any C.
- The
find
tool uses fts
, and you can always subprocess
to it if you can't use fts
directly.
Would threading fasten things up?
That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.
Still, this is pretty easy to test; why not try it and see?
One more idea: you may be spending a lot of time spawning and communicating with those file
processes. There are multiple Python libraries that use the same libmagic
that it does. I don't want to recommend one in particular over the others, so here's search results.
As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.
While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.
Finally, but most importantly:
- Profile your code to see where the hotspots are, instead of guessing.
- Create small data sets and benchmark different alternatives to see how much benefit you get.