1

The task:

I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.

I want to index all those files in a sqlite3 database with the following schema:

path TEXT, mimetype TEXT, filetype TEXT, size INT

So far:

I os.walk recursively through the mounted directory, execute the linux file command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way

You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py

The problem:

The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.

The questions:

  1. Is there a faster alternative to os.walk?
  2. Would threading fasten things up?
Steffen
  • 733
  • 2
  • 10
  • 24
  • 5
    make sure your doing bulk commits and not using auto commit. – monkut May 07 '15 at 06:03
  • 2
    If `os.walk` is your bottleneck, the forthcoming Python 3.5 has [`os.scandir`](https://www.python.org/dev/peps/pep-0471/) which is basically an optimized `os.walk`. – Steven Rumbalski May 07 '15 at 06:05
  • 2
    I doubt threading would help much here since this sounds like your code is primarily I/O bound. – Lukas Graf May 07 '15 at 06:10
  • You may want to consider using a Python library like `filemagic` instead of spawning a subprocess for each file. – abarnert May 07 '15 at 06:12
  • 2
    If you really really want to know for sure, then you should benchmark it and find out where the time is going. If I were to simply speculate, I'd guess that spawning new processes is no small overhead, and look at [guessing the file type using a libmagic binding for Python](http://stackoverflow.com/a/1974737/391161). – merlin2011 May 07 '15 at 06:12
  • 1
    Analyzing 4TB of data will simply not be fast. Run the program and leave for the night/weekend. – TigerhawkT3 May 07 '15 at 06:13
  • 2
    Also, I'm pretty sure there are libraries that wrap `fts` on PyPI (I was writing one myself a few years ago, until I found one already done…), which will be significantly faster than `scandir` on most filesystems (which is already faster than `walk`). Or, if you can't find one, you may want to subprocess out to `find`. (Yes, the exact opposite I gave you with `file`… but here, it's one subprocess that does a lot of work, not millions of subprocess that do very little work.) – abarnert May 07 '15 at 06:14
  • @LukasGraf I also used the python modules you suggest, but the file command on Arch is much more verbose: exact pdf and docx versions, image size and resolution, line endings, ... – Steffen May 07 '15 at 06:17
  • You should try to use profiling on the code to see what is so slow. Alternatively you can just let the program walk through the files without identifing the mime type and saving the data. Just to see if `os.walk()` is the problem. I could imaging it is the process spawning or I/O instead. – Klaus D. May 07 '15 at 06:18
  • 1
    fyi: @monkut bulk hint improved the code, 500GB in 10 minutes. – Steffen May 07 '15 at 07:19
  • And how long was it taking before? – TigerhawkT3 May 07 '15 at 07:29
  • 500GB in 10 minutes sounds reasonable. BTW, I took a peek at the code, just want to mention it's probably worth your time to get familiar with `argparse`, it takes care of most of the crap when building command line tools. – monkut May 07 '15 at 09:08
  • @StevenRumbalski `scandir()` is not faster than `os.fwalk` (note: `f`). – jfs May 09 '15 at 15:11
  • 1
    @monkut: for small personal projects (when I can install stuff), [`docopt`](http://docopt.org/) module allows to create a command-line parser from a usage message. – jfs May 09 '15 at 15:16

1 Answers1

6

Is there a faster alternative to os.walk?

Yes. In fact, multiple.

  • scandir (which will be in the stdlib in 3.5) is significantly faster than walk.
  • The C function fts is significantly faster than scandir. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes or cffi if you know any C.
  • The find tool uses fts, and you can always subprocess to it if you can't use fts directly.

Would threading fasten things up?

That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.

Still, this is pretty easy to test; why not try it and see?


One more idea: you may be spending a lot of time spawning and communicating with those file processes. There are multiple Python libraries that use the same libmagic that it does. I don't want to recommend one in particular over the others, so here's search results.


As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.

While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.


Finally, but most importantly:

  • Profile your code to see where the hotspots are, instead of guessing.
  • Create small data sets and benchmark different alternatives to see how much benefit you get.
abarnert
  • 354,177
  • 51
  • 601
  • 671