Marked this as answered and started a simpler topic around where the speed issue really appear to be
Python slow read performance issue
Thanks for all comments to date, very useful
I have around 40M XML files spread (not evenly) across approx. 60K subdirectories, the structure is based on a 10 digit number split so:
12/34/56/78/90/files.xml
I have a perl script which runs against the files pulling the value of a single field out and prints the value and the filename. The Perl script is wrapped in a bash script which runs max 12 parallel instances across a list of all the directories at depth 2 and then walking down each and processing the files at the bottom level as it finds them.
Taking disk caching out from multiple runs a unix time of the process returns approx:
real 37m47.993s
user 49m50.143s
sys 54m57.570s
I wanted to migrate this to a python script (as a learning exercise and test) so created the following (after a lot of reading up on python methods for various things):
import glob, os, re
from multiprocessing import Pool
regex = re.compile(r'<field name="FIELDNAME">([^<]+)<', re.S)
def extractField(root, dataFile):
line = ''
filesGlob = root + '/*.xml'
global regex
for file in glob.glob(filesGlob):
with open(file) as x:
f = x.read()
match = regex.search(f)
line += file + '\t' + match.group(1) + '\n'
dataFile.write(line)
def processDir(top):
topName = top.replace("/", "")
dataFile = open('data/' + topName + '.data', 'w')
extractField(top, dataFile)
dataFile.close()
filesDepth5 = glob.glob('??/??/??/??/??')
dirsDepth5 = filter(lambda f: os.path.isdir(f), filesDepth5)
processPool = Pool(12)
processPool.map(processDir, dirsDepth5)
processPool.close()
processPool.join()
But no matter how I slice the content when I run it unix time gives me this kind of result:
real 131m48.731s
user 35m37.102s
sys 48m11.797s
If I run both the python and perl script in a single thread against a small subset (that ends up getting fully cached) so there is no disk io (according to iotop) then the scripts run in almost identical times.
The only conclusion I can think of so far is that the file io is much less efficient in the python script than it is in the perl script as it seems to be io that is causing the issue.
So hopefully that's enough background, my question is am I doing something stupid or missing a trick as I'm running out of ideas but can't believe the io is causing such a difference in processing times.
Appreciate any pointers and will provide more info as/if required.
Thanks
Si
For reference Perl script is below:
use File::Find;
my $cwd = `pwd`;
chomp $cwd;
find( \&hasxml, shift );
sub hasxml {
if (-d) {
my @files = <$_/*.xml>;
if ( scalar(@files) > 0 ) {
process("$cwd/${File::Find::dir}/$_");
}
}
}
sub process {
my $dir = shift;
my @files = <$dir/*.xml>;
foreach my $file (@files) {
my $fh;
open( $fh, "< $file" ) or die "Could not read file <$file>";
my $contents = do { local $/; <$fh> };
close($fh);
my ($id) = $contents =~ /<field name="FIELDNAME">([^<]+)<\/field>/s;
print "$file\t<$id>\n";
}
}