Python multiprocessing text extraction performance issue vs Perl equivalent

Question

Marked this as answered and started a simpler topic around where the speed issue really appear to be

Thanks for all comments to date, very useful

I have around 40M XML files spread (not evenly) across approx. 60K subdirectories, the structure is based on a 10 digit number split so:

12/34/56/78/90/files.xml

I have a perl script which runs against the files pulling the value of a single field out and prints the value and the filename. The Perl script is wrapped in a bash script which runs max 12 parallel instances across a list of all the directories at depth 2 and then walking down each and processing the files at the bottom level as it finds them.

Taking disk caching out from multiple runs a unix time of the process returns approx:

real    37m47.993s
user    49m50.143s
sys     54m57.570s

I wanted to migrate this to a python script (as a learning exercise and test) so created the following (after a lot of reading up on python methods for various things):

    import glob, os, re
    from multiprocessing import Pool

    regex = re.compile(r'<field name="FIELDNAME">([^<]+)<', re.S)

    def extractField(root, dataFile):
            line = ''
            filesGlob = root + '/*.xml'
            global regex
            for file in glob.glob(filesGlob):
                    with open(file) as x:
                            f = x.read()
                    match = regex.search(f)
                    line += file + '\t' + match.group(1) + '\n'

            dataFile.write(line)

    def processDir(top):
            topName = top.replace("/", "")
            dataFile = open('data/' + topName + '.data', 'w')
            extractField(top, dataFile)
            dataFile.close()

    filesDepth5 = glob.glob('??/??/??/??/??')
    dirsDepth5 = filter(lambda f: os.path.isdir(f), filesDepth5)
    processPool = Pool(12)
    processPool.map(processDir, dirsDepth5)
    processPool.close()
    processPool.join()

But no matter how I slice the content when I run it unix time gives me this kind of result:

real    131m48.731s
user    35m37.102s
sys     48m11.797s

If I run both the python and perl script in a single thread against a small subset (that ends up getting fully cached) so there is no disk io (according to iotop) then the scripts run in almost identical times.

The only conclusion I can think of so far is that the file io is much less efficient in the python script than it is in the perl script as it seems to be io that is causing the issue.

So hopefully that's enough background, my question is am I doing something stupid or missing a trick as I'm running out of ideas but can't believe the io is causing such a difference in processing times.

Appreciate any pointers and will provide more info as/if required.

Thanks

Si

For reference Perl script is below:

use File::Find;

my $cwd = `pwd`;
chomp $cwd;
find( \&hasxml, shift );

sub hasxml {
    if (-d) {
        my @files = <$_/*.xml>;
        if ( scalar(@files) > 0 ) {
            process("$cwd/${File::Find::dir}/$_");
        }
    }
}

sub process {
    my $dir = shift;

    my @files = <$dir/*.xml>;

    foreach my $file (@files) {
        my $fh;
        open( $fh, "< $file" ) or die "Could not read file <$file>";
        my $contents = do { local $/; <$fh> };
        close($fh);
        my ($id) = $contents =~ /<field name="FIELDNAME">([^<]+)<\/field>/s;
        print "$file\t<$id>\n";
    }
}

maybe this will help you: http://stackoverflow.com/questions/14863224/efficient-reading-of-800-gb-xml-file-in-python-2-7 I think you should profile your application, to see where it spends the most time, otherwise it is a bit of guessing... — m.wasowski, Oct 01 '14 at 22:44
I will take a look thanks, ah should have added the XML files I'm reading are at most 4K each in size but I will try the buffering mentioned to see. I tried profiling but when I run - python -m cProfile script.py it barfs, I suspect it doesn't know how to handle the multiprocessing element but I confess I've not looked at the error in detail. — Simon, Oct 01 '14 at 22:47
user/sys is lower (hinting python is faster) but real is much higher (hinting we are not getting parallelization)... The parent/child protocol is significant. Try `processPool.map(processDir, dirsDepth5, chunksize=16)` which sends jobs in larger batches, to see if it makes a difference. — tdelaney, Oct 01 '14 at 23:02
Interesting hadn't seen that parameter, will give it a try along with working out how to profile a multiprocess script having read a few posts on how to do it in code rather than from the command line which I've read doesn't play nice with multiprocess scripts! — Simon, Oct 01 '14 at 23:06
Another thought, map() builds up a list of results - 40 million in your case. Try `for _ in processPool.imap_unordered(processDir, dirsDepth5, chunksize=16): pass`. — tdelaney, Oct 01 '14 at 23:21
@tdelaney not sure I follow on the last comment, the size of dirsDepth5 is approx 60K then under those 60K dirs are the 40M files. — Simon, Oct 01 '14 at 23:31
Oops, steered you wrong there... I was thinking about how much time the glob/isdir part of the code was taking and kinda jumped the gun on the map part. It would be interesting to time the globbing and see if using iglob and filtering before building the list is faster - `[path for path in glob.iglob('??/??/??/??/??') if os.path.isdir(path)]` is faster. — tdelaney, Oct 01 '14 at 23:53
@m.wasowski managed to get profiling working on the multiprocessing, the bottleneck is read(). Running iotop the Perl script with 6 processes maintains a DISK READ in the 90M/s range, the Python with 6 processes never gets much above 30M/s. So seems Perl is reading from disk 3 times faster. I tried adding "buffering=(200*1024) + 8" to make sure it always got the file in one go but that made no difference. — Simon, Oct 02 '14 at 10:36
@tdelaney thanks for that tip, seems it's the disk read() that's the issue, see previous comment, trying to work out why! — Simon, Oct 02 '14 at 10:37

skrrgwasme · Answer 1 · 2014-10-02T23:29:44.397

Depending on the structure of your XML files, you may be able to get some time savings by using mmap. Currently, you're reading in the entire file even though you're only interested in one entry. If your data tends to occur near the top of your files, you can map the file into memory instead of actually reading it in, perform your regex search exactly the same as you already are, and be done with it.

Here is a comparison of the two methods:

I have a text file called "tmp_large.txt" which has 1,000,000 lines in it. Each line has the alphabet in lowercase. In just one line about halfway through the file, I've replaced the letter 'm' with 'x', and I'm searching for that string:

import re
import mmap

from timeit import timeit
from datetime import timedelta

c_regex = re.compile('defghijklxnopqrstuvwx')

def read_file():
    with open('tmp_large.txt', 'r') as fi:
        f = fi.read()
        match = c_regex.search(f)

def mmap_file():
    with open('tmp_large.txt', 'r+b') as fi: # must open as binary for mmap
        mm = mmap.mmap(fi.fileno(), 0)
        match = c_regex.search(mm)
        mm.close()

t1 = timedelta(seconds=timeit(read_file, setup='gc.enable()', number=1))
t2 = timedelta(seconds=timeit(mmap_file, setup='gc.enable()', number=1))

print(t1)
print(t2)

This scenario produces this output:

0:00:00.036021
0:00:00.028974

We see an execution time savings of a bit less than a third. However, if I place the string I'm looking for at the top of the input file, we see this result:

0:00:00.009327
0:00:00.000338

Obviously both methods are faster, but the time savings is much more significant for the memory mapped method.

Since I don't know the structure of your data or how large your files are, you may see less dramatic results from this. But as long as the data you're seeking isn't at the end of your target files, you'll probably see a bit of an improvement from memory mapping your files, since it will avoid bringing data into memory that you don't actually end up using.

As a side note, I also tried iterating over lines in a file until we hit a line that matched the regex, but it was too slow to bother including here. Also, I did confirm that the regex was actually matching in my example, but I removed the print code and results for brevity

As suggested in the comments, using the iterator iglob and replacing map with something like apply_async may help speed things up too, because they would both help reduce your memory footprint:

processPool = Pool(12)

for dir_or_file in glob.iglob('??/??/??/??/??'):
    if os.path.isdir(dir_or_file):
        processPool.apply_async(processDir, (dir_or_file,))

processPool.close()
processPool.join()

This approach would also allow your subprocesses to get started on processing the first files while you're still identifying the rest.

A few other code notes for you:

You don't need the re.S flag on your regex since you don't actually have any '.' in the regex pattern.
Unless you have some compelling reason not to, you should open your output file with the with open() construct the same way you're opening your input files to prevent stray open file descriptors if you have an exception.
When you're calculating the dataFile and filesGlob, consider using os.path.join() instead of manually adding in the path separators. It'll be less error-prone in the long run.
You don't need your global regex line. You can always read and call methods of global objects without it, just like my example does. You only need it when you are going to modify a global.
Just in case you're not aware of it, mutliprocessing pools will, by default, only launch as many workers as you have CPU cores. If you already knew that, please ignore this comment. Specifying 12 processes for your pool just seemed a bit odd to me.

That's some great info thanks, I've done what you mentioned and it has improved things somewhat but it seems the disk read speed is still an issue as even with the changes the Perl version vastly out performs the Python one, on a set of around 250k files totally approx 1GB python took 35s and Perl 17s. I've started a new cut down question regarding this https://stackoverflow.com/questions/26178038/python-slow-read-performance-issue . Thanks for your suggestions taught me a lot in the process :-) — Simon, Oct 03 '14 at 11:23
@Simon: you could use `iglob()` with `for result in pool.imap_unordered(process_dir, (d for d in iglob(...) if isdir(d)), chunksize=1000): ...` to avoid queuing all files at once. Make sure that you catch and log (or return) all exceptions in `process_dir()`. — jfs, Oct 03 '14 at 16:48
@Simon: as far as I know `mmap()` doesn't improve performance on Linux but it might on Windows. But `mmap()` is convenient if you need to treat potentially large file as a bytestring as @SLawson showed with the regex. If it is xml file, you could use `etree.iterparse()` to [avoid using regexes to parse xml/html](http://stackoverflow.com/a/1732454/4279) :) — jfs, Oct 03 '14 at 16:56
@J.F.Sebastian mmap certainly *can* provide speed improvements over `file.read()` on Linux, as the above test shows (I'm using RHEL 6). In fact, I just ran it again with the target line placed in the very last line of the file, and memory mapping was still consistently (very) slightly faster than `file.read()`. — skrrgwasme, Oct 03 '14 at 17:09
@SLawson: I think it is a benchmark artifact. See [map vs reading blocks](http://stackoverflow.com/q/45972/4279) — jfs, Oct 03 '14 at 17:36
@J.F.Sebastian What do you mean by benchmark artifacts? I can't think of any reason timeit shouldn't be accurately reflecting the amount of time it takes to execute the command. Your link has some great information about mmap vs normal read, but I don't really see how it refutes the timing results. Can you explain that a bit? If that's too big a question for comments, I'll happily go ask a new question about it or move to chat instead. — skrrgwasme, Oct 03 '14 at 18:24
Your benchmark (if correct) measures a slightly different thing: it shows that mmap might be faster than reading the whole file if we can stop reading mmap sooner (regex found a match)(it is like reading line by line but mmap allows a multiline regex). In general, If [the link](http://stackoverflow.com/q/45972/4279) haven't convinced you then I won't either. — jfs, Oct 03 '14 at 18:39
@J.F.Sebastian I understand your comment now, and I wholeheartedly agree. My example is contrived, and the performance differences are definitely dependent on the location of the target data within the files. But I thought I emphasized that point well enough in my answer, because I mentioned it twice. I wasn't trying to make a claim about mmap io vs block io in general, just that *if* the OPs files fit this pattern, then in this *specific* case, he could see some benefit from using mmap. I'll look over my answer and consider editing to clarify this. — skrrgwasme, Oct 03 '14 at 18:49
For the record and clarification I agree mmap will be slightly faster as the pattern I want is 90% of the time in the first 1/3rd of te file however the files are on average only 10kb in size so it doesn't net me much gain but from an "I prefer to write scripts that are as fast/efficient as they can be as you never know when the data may change" point of view the info above has been of great help. — Simon, Oct 03 '14 at 19:47

score 1 · Accepted Answer · edited May 23 '17 at 12:14

EDIT

Forgot to add thanks to the contributors on this thread:

Python slow read performance issue

who helped me solve this.

EDIT

It all boiled down in the end to the order of the directory read, this applied to my main application as well as the tests.

Basically Perl sorts lexographically (i.e. 1,11,2,22) by default, Python sorts by directory order (ls -U) and the files are created in a natural order (1,2,3,4) so I took the original Python slurp and created a slurpNatural after some searching Stackoverflow for a simple natural sort:

import glob, sys, re

def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
    return [int(text) if text.isdigit() else text.lower()
            for text in re.split(_nsre, s)]

for file in sorted(glob.iglob(sys.argv[1] + '/*.xml'), key=natural_sort_key):
    with open(file) as x:
        f = x.read()

I then ran all 3 against the 50K docs and got:

$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ /usr/bin/time perl slurp.pl 1
1.21user 2.17system 0:12.70elapsed 26%CPU (0avgtext+0avgdata 9140maxresident)k
1234192inputs+0outputs (22major+2466minor)pagefaults 0swaps

$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ /usr/bin/time python slurp.py 1
2.88user 6.13system 4:48.00elapsed 3%CPU (0avgtext+0avgdata 8020maxresident)k
1236304inputs+0outputs (35major+52252minor)pagefaults 0swaps

$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ /usr/bin/time python slurpNatural.py 1
1.25user 2.82system 0:10.70elapsed 38%CPU (0avgtext+0avgdata 22408maxresident)k
1237360inputs+0outputs (35major+56531minor)pagefaults 0swaps

The natural sort which mirrors the creation order is clearly the fastest and in this case mirrors how my actual data is created and so have now changed the Python to sort the directory contents before processing.

Thanks for all the help, I honestly didn't think the order of reading the files would make such a big difference!

Si

I know this answer has been here for a while, but I just now noticed it. I think you should consider accepting your own answer instead of mine. It looks like it does a better job of addressing the real issue in your question. I definitely learned something new from reading it, and think others will as well. Accepting it will bump it up to the top to guarantee it is seen (it might not right now, since mine is accepted and has more votes). — skrrgwasme, Aug 12 '15 at 17:46

Python multiprocessing text extraction performance issue vs Perl equivalent

2 Answers2