32

I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.

The results of the tests below on a subset of 4500 files on a shared network are

Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds

That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?

The test for thumbs.db is just for the test, in reality there are more tests to do.

I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.

#python2.7.0
import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

if __name__ == '__main__':
  import timeit
  path = '//server/share/folder/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))
'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\\server\share\folder"
start = Timer
myLCfilename="thumbs.db"

sub recurse(folder)
  for each file in folder.Files
    if lCase(file.name) = myLCfilename then
      wscript.echo file
    end if
  next
  for each subfolder in folder.SubFolders
    call Recurse(subfolder)
  next
end Sub

set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start
#ruby1.9.3
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
  end
end

path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}

EDIT: since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter

EDIT2: based on the answers and comments here a Python version that in some cases could run faster by skipping folders

import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

def recurse2(path):
    for (path, dirs, files) in os.walk(path):
        for dir in dirs:
            if dir in ('comics'):
                dirs.remove(dir)
        for file in files:
            if file.lower() == "thumbs.db":
                print (path+'/'+file)


if __name__ == '__main__':
  import timeit
  path = 'f:/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1)) 
#6.20102692
  print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1)) 
#2.73848228
#ruby 5.7
peter
  • 41,770
  • 5
  • 64
  • 108
  • Bloody good question, this. I can't see any obvious improvements that can be made to the Python code. I'm not near a decent internet connection, and so can't download Ruby and test this at the moment. – Chinmay Kanchi Oct 30 '12 at 11:56
  • Hmmm... having said that, I wonder if the `for` loop is the culprit here. Try `fnames = [name for name in files if name.lower()=='thumbs.db']`, followed by `if fnames: print fnames[0]`. – Chinmay Kanchi Oct 30 '12 at 12:01
  • 1
    Are you sure that this behaviour is not connected to some network cache policy? The speed differences between Python and Ruby shouldn't be so massive... – Paolo Moretti Oct 30 '12 at 12:03
  • 1
    This might be silly, but have you thought of trying to walk through the output of `glob.glob` (using `os.path.basename`)? I doubt it would be faster, but it might be. As far as I can tell, that seems a little closer to what you're doing in `ruby`. With the python script, you probably have a decent amount of overhead inquiring the filesystem whether a give name is a directory, or whether it's a file ... – mgilson Oct 30 '12 at 12:16
  • @Paolo: yes i'm sure, i rerun the scripts several times in different order – peter Oct 30 '12 at 12:17
  • @Chinmay: i adapted my python code as such and it became a bit better, the timing is now an average of 98 seconds, still a huge difference – peter Oct 30 '12 at 12:18
  • @Peter: Ok, that isn't the problem then. Just out of curiosity, is there a Linux machine you can test this on? @mgilson, how would you change into a directory unless you know it is one. You'd still have to check the each file in the output of `glob.glob()` with `os.path.isdir()`. – Chinmay Kanchi Oct 30 '12 at 12:25
  • @ChinmayKanchi -- I don't know how `glob` is implemented exactly -- It seems feasible to me that it *could* have a more direct interface to the filesystem (though that may not be the case). – mgilson Oct 30 '12 at 12:35
  • @mgilson: I just looked at `glob.py` in the Python standard library. It doesn't do anything special, just uses a combination of `fnmatch`, regex and `os.listdir` to do its job. – Chinmay Kanchi Oct 30 '12 at 12:41
  • 1
    @mgilson, @peter: The `glob.glob` thing was a good idea, since it gave me a possible hint. The `/**/**` wildcard in `glob.glob()` only seems to go two directory levels deep, while `os.walk()` will enter directories up to arbitrary levels. Are you sure the Ruby wildcard pattern is right. A count of the number of results will probably give you a good idea if the results are identical or not. – Chinmay Kanchi Oct 30 '12 at 12:48
  • 1
    @ChinmayKanchi See http://ideone.com/jsj2lq for a proof that with the pattern `/**/**` you get deeper than 2 levels. – halex Oct 30 '12 at 12:55
  • @Chinmay: in ruby /**/** goes all levels deep, i tested it and got all files, can one of yoy publish a python version using glob ? then i can test it and we continue the discussion about that aspect under that answer – peter Oct 30 '12 at 13:01
  • 1
    The big variable nobody is stalking about is that this is a network folder. Can somebody prove this is true on a local folder? – Doug T. Oct 30 '12 at 13:06
  • @Doug: Even on a network, the effects shouldn't be so great. I can reproduce this discrepancy (for a network folder of ~2000 files, several levels deep) with Python 2.6, 2.7 and 3.2, vs. Ruby 1.9.3. My guess is, the Ruby walker is implemented in C and is less flexible (it's just an iterator over all the files, whereas in Python you can control the recursion by e.g. changing the dirs in-place). – Vinay Sajip Oct 30 '12 at 13:11
  • @VinaySajip yeah I mean there could be any number of library implementation reasons (network caching being an obvious one). Whoever answers this probably needs to be well versed in both the Ruby/Python std lib implementations. – Doug T. Oct 30 '12 at 13:16
  • I'd be curious to run Wireshark with these two implementations against a live network share and compare the traffic (assuming its not encrypted and can be easily analyzed) – Doug T. Oct 30 '12 at 13:17
  • 2
    @Doug: tested it on a local disk with +166000 files and here the difference is less: R:12 and P:16 – peter Oct 30 '12 at 13:29
  • Your Ruby code doesn't work for me. On a local folder, Ruby fails to find any `thumbs.db` files, while Python finds a fair number. – Chinmay Kanchi Oct 30 '12 at 13:36
  • @Chinmay: just tested again and Ruby found them all, must be an OS thing, i'm on Windows, can anyone test this on a network with a nix or mac ? – peter Oct 30 '12 at 13:42
  • I'm on Windows 7 as well. This is bizarre. If I understand the Ruby code correctly, it is supposed to print out the name of every `thumbs.db` file, right? – Chinmay Kanchi Oct 30 '12 at 13:48
  • @Chinmay: yes, just tested my published version to be sure, works, only changed path to 'e:/' for a local search – peter Oct 30 '12 at 14:18
  • Weirder and weirder, the Ruby code works with "C:/" but not with "C:/Users/" (with or without the trailing /). – Chinmay Kanchi Oct 30 '12 at 14:25
  • works with 'c:/users' on my vistabox at work where i have local admin rights, i'll try it at home on my Windows 7 box – peter Oct 30 '12 at 14:33
  • 2
    Can you add to your question the comment that you addressed to @Doug when you said: **tested it on a local disk with +166000 files and here the difference is less: R:12 and P:16** because i believe that with what you just found you can narrow your question to the difference between Python's ``os.walk('//network-filepath')`` and Ruby ``Dir["//network-filepath"]``. – mouad Oct 30 '12 at 15:15
  • @mouad: i believe the title of this question is clear enough that it is about a network crawl, and all be the difference less, even for local access is big enough, python has a reputation of being fast. I believe my question is as good as answered, there are no faster ways and in this case Ruby is just faster i guess. It would be interesting to get some results from other OS. And someone should wraps this all up in an answer. I believe it will be worth while considering the interest this question gets. – peter Oct 30 '12 at 16:03
  • I updated my answer in response to your comment. – Vinay Sajip Nov 01 '12 at 00:43
  • I updated my answer again in response to your next comment. – Vinay Sajip Nov 01 '12 at 15:13
  • Thanks for accepting my answer. There's a mistake in your updated code: to have a one-element tuple, you have to have a trailing comma thus: `('comics',)`. – Vinay Sajip Nov 01 '12 at 21:54
  • Since the slow operation in the original VBscript is creating 2 string objects every time around the loop, you can make it approx twice as fast by taking one string creation out of the loop *myLCfilename="thumbs.db"* and comparing file.name to the variable: *if lCase(file.name) = sMyLCfilename*. You can pick up another 10% by removing the line *set folder = oFso.getFolder(path)* and fixing up. The variable called Path is already a folder object, except at the first call. Existing code only works because you can call getfolder on a folder object. ~on my test, I go from 55sec to 33sec – david Aug 13 '13 at 11:54
  • thanks for the suggestion david and you are right of course but with your adapted code the difference was neglectible on my system, i suppose you tested this on a LOCAL drive and not on a networkshare, as indicated in the comments above and this makes a huge difference.. In any way i'll adapt the vbscript part of the question with your improvements – peter Aug 13 '13 at 14:46

2 Answers2

9

The Ruby implementation for Dir is in C (the file dir.c, according to this documentation). However, the Python equivalent is implemented in Python.

It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn', '.git', '.hg' while traversing a directory hierarchy.

Most of the time, the Python implementation is fast enough.

Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).

Additional update: Skipping folders is done by changing the dirs value in place:

for root, dirs, files in os.walk(path):
    for skip in ('.hg', '.git', '.svn', '.bzr'):
        if skip in dirs:
            dirs.remove(skip)
        # Now process other stuff at this level, i.e.
        # in directory "root". The skipped folders
        # won't be recursed into.
Vinay Sajip
  • 95,872
  • 14
  • 179
  • 191
  • 1
    I think that was the point. He _wants_ it to go through all the files and perform some operation while producing minimal output. He does the same thing in the Ruby code, as far as I can tell. – Chinmay Kanchi Oct 30 '12 at 11:47
  • 1
    @Chinmay: Thanks for pointing out my reading deficiency. I updated my answer after reading more carefully :-) – Vinay Sajip Oct 31 '12 at 17:46
  • @Vinay, and does this skipping affect the speed ? does the search go faster when you skip the subtrees ? In Ruby you can also .reject some of the results but your search won't go faster – peter Oct 31 '12 at 20:27
  • @Vinay, could you show how to skip folders in Python so that i can test if they are processed or not ? In Ruby they are processed anyway and a'm not that far in Python already – peter Nov 01 '12 at 11:03
  • Vinay, had to adapt the code, see my edited question, thanks for the help, in my case on a networkshare and without folders that i can exclude i'm better of taking Ruby, hope this helps a lot of Ruby AND Python beginners – peter Nov 01 '12 at 16:42
2

I setup directory structure with the following locally:

for i in $(seq 1 4500); do
    if [[ $i -lt 100 ]]; then
        dir="$(for j in $(seq 1 $i); do echo -n $i/;done)"
        mkdir -p "$dir"
        touch ${dir}$i
    else
        touch $i
    fi
done

This creates 99 files with paths that are 1-99 levels deep and 4401 files in the root of the directory structure.

I used the following ruby script:

#!/usr/bin/env ruby
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"]
  end
end

path = 'files'
Benchmark.bm {|bench| recursive(path, bench)}

I got the following result:

           user     system      total        real
    files/  0.030000   0.090000   0.120000 (  0.108562)

I use the following python script using os.walk:

#!/usr/bin/env python

import os
import timeit

def path_recurse(path):
    for (path, dirs, files) in os.walk(path):
      for folder in dirs:
          yield '{}/{}'.format(path, folder)
      for filename in files:
          yield '{}/{}'.format(path, filename)

if __name__ == '__main__':
    path = 'files'
    print(timeit.timeit('[i for i in path_recurse("'+path+'")]', setup="from __main__ import path_recurse", number=1))

I got the following result:

    0.250478029251

So, it looks like ruby is still performing better. It'd be interesting to see how this one performs on your fileset on the network share.

It would probably also be interesting to see this script run on python3 and with jython and maybe even with pypy.

Wren T.
  • 596
  • 1
  • 5
  • 17