Python slow read performance issue

Question

Following an earlier thread I boiled down my problem to it's bare bones, in migrating from a Perl script to a Python one I found a huge performance issue with slurping files in Python. Running this on Ubuntu Server.

NB: this is not a X vs. Y thread I need to know fundamentally if this is how it is or if I'm doing something stupid.

I created my test data, 50,000 10kb files (this mirrors the avg file size of what I'm processing):

mkdir 1
cd 1
for i in {1..50000}; do dd if=/dev/zero of=$i.xml bs=1 count=10000; done
cd ..
cp -r 1 2

Created my 2 scripts as simply as possible:

Perl

foreach my $file (<$ARGV[0]/*.xml>){
    my $fh;
    open($fh, "< $file");
    my $contents = do { local $/; <$fh> };
    close($fh);
}

Python

import glob, sys
for file in glob.iglob(sys.argv[1] + '/*.xml'):
    with open(file) as x:
        f = x.read()

I then cleared the caches and ran my 2 slurp scripts, between each run I cleaned the caches again using:

sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then monitored to ensure it was reading everything from disk each time:

sudo iotop -a -u me

I tried this on a physical machine with RAID 10 disks and on a brand new VM I setup where the VM is on RAID 1 SSDs, have just included the test runs from my VM as the physical server was much the same just faster.

$ time python readFiles.py 1
    real    5m2.493s
    user    0m1.783s
    sys     0m5.013s

$ time perl readFiles.pl 2
    real    0m13.059s
    user    0m1.690s
    sys     0m2.471s

$ time perl readFiles.pl 2
    real    0m13.313s
    user    0m1.670s
    sys     0m2.579s

$ time python readFiles.py 1
    real    4m43.378s
    user    0m1.772s
    sys     0m4.731s

I noticed on iotop when Perl was running DISK READ was around 45 M/s and IOWAIT approx 70%, when running Python DISK READ was 2M/s and IOWAIT 97%. I'm not sure where to go from here having boiled them down to as simple as I can get.

In case it is relevant

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2

$ perl -v
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi

FURTHER INFO AS REQUESTED

I ran strace and grabbed the info for file 1000.xml but all seem to do the same things:

Perl

$strace -f -T -o trace.perl.1 perl readFiles.pl 2

32303 open("2/1000.xml", O_RDONLY)      = 3 <0.000020>
32303 ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7fff7f6f7b90) = -1 ENOTTY (Inappropriate ioctl for device) <0.000016>
32303 lseek(3, 0, SEEK_CUR)             = 0 <0.000016>
32303 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000016>
32303 fcntl(3, F_SETFD, FD_CLOEXEC)     = 0 <0.000017>
32303 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000030>
32303 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.005323>
32303 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 1808 <0.000022>
32303 read(3, "", 8192)                 = 0 <0.000019>
32303 close(3)                          = 0 <0.000017>

Python

$strace -f -T -o trace.python.1 python readFiles.py 1

32313 open("1/1000.xml", O_RDONLY)      = 3 <0.000021>
32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000017>
32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000019>
32313 lseek(3, 0, SEEK_CUR)             = 0 <0.000018>
32313 fstat(3, {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000018>
32313 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa18820a000 <0.000019>
32313 lseek(3, 0, SEEK_CUR)             = 0 <0.000018>
32313 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 <0.006795>
32313 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 1808 <0.000031>
32313 read(3, "", 4096)                 = 0 <0.000018>
32313 close(3)                          = 0 <0.000027>
32313 munmap(0x7fa18820a000, 4096)      = 0 <0.000022>

One difference I noticed, not sure if it's relevant, is that Perl appears to run this against all files before it starts opening them whereas python doesn't:

32303 lstat("2/1000.xml", {st_mode=S_IFREG|0664, st_size=10000, ...}) = 0 <0.000022>

Also ran strace with -c (just took top few calls):

Perl

$ time strace -f -c perl readFiles.pl 2
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.07    3.501471          23    150018           read
 12.54    0.996490          10    100011           fstat
  9.47    0.752552          15     50000           lstat
  7.99    0.634904          13     50016           open
  6.89    0.547016          11     50017           close
  6.19    0.491944          10     50008     50005 ioctl
  6.12    0.486208          10     50014         3 lseek
  6.10    0.484374          10     50001           fcntl

real    0m37.829s
user    0m6.373s
sys     0m25.042s

Python

$ time strace -f -c python readFiles.py 1
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 42.97    4.186173          28    150104           read
 15.58    1.518304          10    150103           fstat
 10.51    1.023681          20     50242       174 open
 10.12    0.986350          10    100003           lseek
  7.69    0.749387          15     50047           munmap
  6.85    0.667576          13     50071           close
  5.90    0.574888          11     50073           mmap

real    5m5.237s
user    0m7.278s
sys     0m30.736s

Did some parsing of the strace output with -T turned on and counted the first 8192 byte read for each file and it's clear this is where the time is going, below is the total time spent for the 50000 first reads of a file followed by the average time for each read.

300.247128000002 (0.00600446220302379)   - Python
11.6845620000003 (0.000233681892724297)  - Perl

Not sure if that helps!

UPDATE 2 Updated code in Python to use os.open and os.read and just do a single read of first 4096 bytes (that would work for me as info I want is in top part of file), also eliminates all the other calls in strace:

18346 open("1/1000.xml", O_RDONLY)      = 3 <0.000026>
18346 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 <0.007206>
18346 close(3)                          = 0 <0.000024>

$ time strace -f -c python readFiles.py 1
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 55.39    2.388932          48     50104           read
 22.86    0.986096          20     50242       174 open
 20.72    0.893579          18     50071           close

real    4m48.751s
user    0m3.078s
sys     0m12.360s

Total Time (avg read call)
282.28626 (0.00564290374812595)

Still no better...next up I'm going to create a VM on Azure and try there for another example!!

UPDATE 3 - Apologies for the size of this!!

Ok some interesting results using your (@J.F.Sebastian) script on 3 setups, stripped the output at start for brevity and also removed all the tests which just run super fast from cache and look like:

0.23user 0.26system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 9140maxresident)k
0inputs+0outputs (0major+2479minor)pagefaults 0swaps

Azure A2 Standard VM (2 cores 3.5GB RAM Disk Unknown but slow)

$ uname -a
Linux servername 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
$ perl -v
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)

+ /usr/bin/time perl slurp.pl 1
1.81user 2.95system 3:11.28elapsed 2%CPU (0avgtext+0avgdata 9144maxresident)k
1233840inputs+0outputs (20major+2461minor)pagefaults 0swaps
+ clearcache
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+ /usr/bin/time python slurp.py 1
1.56user 3.76system 3:06.05elapsed 2%CPU (0avgtext+0avgdata 8024maxresident)k
1232232inputs+0outputs (14major+52273minor)pagefaults 0swaps
+ /usr/bin/time perl slurp.pl 2
1.90user 3.11system 6:02.17elapsed 1%CPU (0avgtext+0avgdata 9144maxresident)k
1233776inputs+0outputs (16major+2465minor)pagefaults 0swaps

Comparable first slurp results for both, not sure what was going on during the 2nd Perl slurp?

My VMWare Linux VM (2 cores 8GB RAM Disk RAID1 SSD)

$ uname -a
Linux servername 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
$ perl -v
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
(with 41 registered patches, see perl -V for more detail)

+ /usr/bin/time perl slurp.pl 1
1.66user 2.55system 0:13.28elapsed 31%CPU (0avgtext+0avgdata 9136maxresident)k
1233152inputs+0outputs (20major+2460minor)pagefaults 0swaps
+ clearcache
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+ /usr/bin/time python slurp.py 1
2.10user 4.67system 4:45.65elapsed 2%CPU (0avgtext+0avgdata 8012maxresident)k
1232056inputs+0outputs (14major+52269minor)pagefaults 0swaps
+ /usr/bin/time perl slurp.pl 2
2.13user 4.11system 5:01.40elapsed 2%CPU (0avgtext+0avgdata 9140maxresident)k
1233264inputs+0outputs (16major+2463minor)pagefaults 0swaps

This time, as before, Perl is way faster on first slurp, unsure what is happening on second Perl slurp though not seen this behaviour before. Ran measure.sh again and result was exactly the same give or take a few seconds. I then did what any normal person would do and updated the kernel to match the Azure machine 3.13.0-35-generic and ran measure.sh again and made no difference to results.

Out of curiosity I then swapped the 1 and 2 parameter in measure.sh and something strange happened..Perl slowed down and Python sped up!

+ /usr/bin/time perl slurp.pl 2
1.78user 3.46system 4:43.90elapsed 1%CPU (0avgtext+0avgdata 9140maxresident)k
1234952inputs+0outputs (21major+2458minor)pagefaults 0swaps
+ clearcache
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+ /usr/bin/time python slurp.py 2
1.19user 3.09system 0:10.67elapsed 40%CPU (0avgtext+0avgdata 8012maxresident)k
1233632inputs+0outputs (14major+52269minor)pagefaults 0swaps
+ /usr/bin/time perl slurp.pl 1
1.36user 2.32system 0:13.40elapsed 27%CPU (0avgtext+0avgdata 9136maxresident)k
1232032inputs+0outputs (17major+2465minor)pagefaults 0swaps

This has just confused me even further :-(

Physical Server (32 cores 132 GB RAM Disk RAID10 SAS)

$ uname -a
Linux servername 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
[GCC 4.6.3] on linux2
$ perl -v
This is perl 5, version 14, subversion 2 (v5.14.2) built for x86_64-linux-gnu-thread-multi
(with 55 registered patches, see perl -V for more detail)

+ /usr/bin/time perl slurp.pl 1
2.22user 2.60system 0:15.78elapsed 30%CPU (0avgtext+0avgdata 43728maxresident)k
1233264inputs+0outputs (15major+2984minor)pagefaults 0swaps
+ clearcache
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
+ /usr/bin/time python slurp.py 1
2.51user 4.79system 1:58.53elapsed 6%CPU (0avgtext+0avgdata 34256maxresident)k
1234752inputs+0outputs (16major+52385minor)pagefaults 0swaps
+ /usr/bin/time perl slurp.pl 2
2.17user 2.95system 0:06.96elapsed 73%CPU (0avgtext+0avgdata 43744maxresident)k
1232008inputs+0outputs (14major+2987minor)pagefaults 0swaps

Here Perl seems to win every time.

baffled

Given the oddity on my local VM, when I swapped directories, which is the machine I have most control over I'm going to try a binary approach on all the possible options of running python vs perl using 1 or 2 as the data directory and try to run them multiple times for consistency but it'll take a while and I'm going a little crazy so break may be required first! All I want is consistency :-(

UPDATE 4 - Consistency

(Below is run on an ubuntu-14.04.1-server VM, Kernel is 3.13.0-35-generic #62-Ubuntu)

I think I've found some consistency, running the tests every way possible for Python/Perl slurp on data dir 1/2 I found the following:

Python is always slow on created files (i.e. created by dd)
Python is always fast on copied files (i.e. created by cp -r)
Perl is always fast on created files (i.e. created by dd)
Perl is always slow on copied files (i.e. created by cp -r)

So I looked at OS level copying and it seems like on Ubuntu 'cp' behaves in the same way as Python, i.e. slow on original files and fast on copied files.

This is what I ran and the results, I did this a few times on a machine with a single SATA HD and on a RAID10 system, results:

$ mkdir 1
$ cd 1
$ for i in {1..50000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done
$ cd ..
$ cp -r 1 2
$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time strace -f -c -o trace.copy2c cp -r 2 2copy
    real    0m28.624s
    user    0m1.429s
    sys     0m27.558s
$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time strace -f -c -o trace.copy1c cp -r 1 1copy
    real    5m21.166s
    user    0m1.348s
    sys     0m30.717s

Trace results show where time is being spent

$ head trace.copy1c trace.copy2c
==> trace.copy1c <==
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 60.09    2.541250          25    100008           read
 12.22    0.516799          10     50000           write
  9.62    0.406904           4    100009           open
  5.59    0.236274           2    100013           close
  4.80    0.203114           4     50004         1 lstat
  4.71    0.199211           2    100009           fstat
  2.19    0.092662           2     50000           fadvise64
  0.72    0.030418         608        50           getdents
==> trace.copy2c <==
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.86    0.802376           8    100008           read
 13.55    0.227108           5     50000           write
 13.02    0.218312           2    100009           open
  7.36    0.123364           1    100013           close
  6.83    0.114589           1    100009           fstat
  6.31    0.105742           2     50004         1 lstat
  3.38    0.056634           1     50000           fadvise64
  1.62    0.027191         544        50           getdents

So it seems copying copies is much faster than copying original files, my current guess is that when copied the files get aligned on disk better than when they were originally created making them more efficient to read?

Interestingly 'rsyn' and 'cp' seem to work in opposite ways speedwise, much like Perl and Python!

$ rm -rf 1copy 2copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Rsync 1"; /usr/bin/time rsync -a 1 1copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Rsync 2"; /usr/bin/time rsync -a 2 2copy
Rsync 1
    3.62user 3.76system 0:13.00elapsed 56%CPU (0avgtext+0avgdata 5072maxresident)k
    1230600inputs+1200000outputs (13major+2684minor)pagefaults 0swaps
Rsync 2
    4.87user 6.52system 5:06.24elapsed 3%CPU (0avgtext+0avgdata 5076maxresident)k
    1231832inputs+1200000outputs (13major+2689minor)pagefaults 0swaps

$ rm -rf 1copy 2copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Copy 1"; /usr/bin/time cp -r 1 1copy; sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; echo "Copy 2"; /usr/bin/time cp -r 2 2copy
Copy 1
    0.48user 6.42system 5:05.30elapsed 2%CPU (0avgtext+0avgdata 1212maxresident)k
    1229432inputs+1200000outputs (6major+415minor)pagefaults 0swaps
Copy 2
    0.33user 4.17system 0:11.13elapsed 40%CPU (0avgtext+0avgdata 1212maxresident)k
    1230416inputs+1200000outputs (6major+414minor)pagefaults 0swaps

Doesn't seem to be a CPU-bound issue. Can you `strace` both to compare how they are reading the file? — nneonneo, Oct 03 '14 at 11:37
@nneonneo I was thinking abut that but there was a lot of data, I've run some tests and included what I hope is the right level of info, only obvious difference I can see is the use of mmap and munmap although not sure that accounts for all the lost time? — Simon, Oct 03 '14 at 12:10
Interesting. I'd like to see how the times compare if the files are full of random bytes (from /dev/urandom) rather than zeroes, to see if it has any impact. AFAIK, your current `dd` command won't create [sparse files](https://wiki.archlinux.org/index.php/sparse_file), but I'm not an expert on that topic. Also, the Python `with` context manager stuff adds some overhead, although I wouldn't expect it to be _that_ big. But I'd still like to see if it affects the times if you use "traditional" file opening and explicit closing rather than using `with`. — PM 2Ring, Oct 03 '14 at 12:16
This is just an incredibly weird problem. There's no CPU usage, there's not any more disk I/O usage. The only difference seems to be the mmap calls, which appear simply to be temporary buffers or something. — nneonneo, Oct 03 '14 at 12:18
@PM2Ring Interesting I shall try the random content after I try losing the 'with' approach. — Simon, Oct 03 '14 at 12:22
@nneonneo Tell me about it been banging my head against the wall with this for 3-4 days now! I just can't believe there would be such a difference in such a simple operation — Simon, Oct 03 '14 at 12:22
I know SFA about Perl, but I gather that the mmap stuff is to do with file buffering, so **maybe** that buffering in Perl is taking advantage of the fact that the files are full of zeroes. And perhaps the Perl interpreter (or the Linux kernel) can buffer large amounts of those zero files into a tiny amount of RAM, which would considerably speed up reading by the Perl script. — PM 2Ring, Oct 03 '14 at 12:24
this is so weird! Have you tried this on another os? Do you tried different HDs? — Hrabal, Oct 03 '14 at 12:45
@Hrabal tried it on 2 setups, physical and virtual, one is using a huge raid 10 array of SAS drives the other my raid 1 SSDs both behave in a similar fashion :-( — Simon, Oct 03 '14 at 12:48
@PM2Ring Oh and losing the 'with' syntax made no difference either — Simon, Oct 03 '14 at 13:32
I ran these on my SSD and got times of 4.7s, 8.8s and 8.2s for Perl, CPython 2 and PyPy respectively. Can you try PyPy to see whether it's an implementation bug? — Veedrac, Oct 03 '14 at 13:40
The only difference I see is the opening mode of the files. In perl you use the read mode ('<') in Python you don't specify and I think default is rW... I tried here at work on my early 00's (sigh) WinXP (sigh sigh) machine the same python code and a variation with read only mode specified over 1000 10kb .txt files..: `open(file)` -> 2.852 s exec time, `open(file,'r')` -> 0.680 s exec time... difference is huge! — Hrabal, Oct 03 '14 at 13:50
@Hrabal updated the open to open(file, 'r+b') but made no noticeable difference — Simon, Oct 03 '14 at 14:04
@Veedrac installed PyPy (PyPy 2.2.1 with GCC 4.8.2) and ran the python script again but made no real difference 5m7s vs 38s for Perl against the random 50000 10kb files — Simon, Oct 03 '14 at 14:06
nevermind, run the scripts again and had completely different results.. guess my work laptop is just too old for perofrmance metering :) — Hrabal, Oct 03 '14 at 14:12
@Hrabal, text mode, binary mode make different only in windows. (assuming Python 2.x) — falsetru, Oct 03 '14 at 14:36
@hrabal I ran with 'r' only just for your sanity (I'll try anything at this point) and made no difference. One of the reasons I fired up a brand new VM on my VM server and shutdown all other VMs was to eliminate any contamination of the results — Simon, Oct 03 '14 at 14:40
I've tried to [run the test](https://gist.github.com/429bf0249b4467574da7) and the result is 19 vs 23 seconds on an ssd disk (25M/s) and 0.8 vs 1 for in memory fs. I have the same perl, python versions as yours. — jfs, Oct 03 '14 at 16:13
@Simon: have you tried to run [measure.sh](https://gist.github.com/zed/429bf0249b4467574da7#file-measure-sh)? I see **no** significant difference between perl and python on Ubuntu 14.04 desktop (I've tested both zero-filled files (yours dd command) and /dev/urandom files ([dd command from the gist](https://gist.github.com/zed/429bf0249b4467574da7#file-generate-input-sh))). — jfs, Oct 03 '14 at 17:11
@J.F.Sebastian I see what you did there with measure.sh, nice, that's pretty much exactly what I'm doing, I'm off to fire up an Azure VM now and will run it and see. *finally realised why J F Sebastian sounded so familiar :-)* — Simon, Oct 03 '14 at 19:43
@J.F.Sebastian Updated with data from tests using your script — Simon, Oct 03 '14 at 21:50
While I understand the concern with performance, what is the actual question here? — Tymoteusz Paul, Oct 03 '14 at 23:12
@Puciek The question is on the production server why is Python so slow compared to Perl when reading a file and why are results not consistent. I plan to do a lot of work on files of this size in volumes of 40M+ and I need to know if I'm going to have to give up on Python and revert to Perl because every time I run a Python script the reading of the files takes orders of magnitude longer. I want to use Python more in day to day work but I don't want to always be wondering if my script is simply slow and my work delayed because of an anomaly I never got to the bottom of. Hope that makes sense? — Simon, Oct 03 '14 at 23:20
@Simon: that doesn't really seem like a question to be asked here. Primarily because as it is you are using it properly and it simply is that python implementation for that use is slower than perl. It may be worthy submitting to python bug tracker though, because such loss in performance is quite major. — Tymoteusz Paul, Oct 03 '14 at 23:25
@Puciek The evidence suggests there is something odd going on with the combination of Python, Perl and Ubuntu on different hardware platforms making it a bit more than just a Python bug? I'm clearly not doing anything stupid in terms of code in either Perl or Python but maybe there is some low level thing that is related to either Ubuntu or even disk configurations that might be causing this and the help here so far has been very useful in trying to narrow down what the issue may be. If I find the answer I'd like to think it being here will help someone in the future. — Simon, Oct 03 '14 at 23:34
Could you measure the timings for 100, 1000, 10000? It would be nice to know whether what you are measuring is linear w.r.t. the number of files read. — Fred Mitchell, Oct 04 '14 at 02:35
I am also wondering how glob.glob and glob.iglob compare in this measurement. — Fred Mitchell, Oct 04 '14 at 03:03
@Simon try hardcoded list of files to see whether it's just `open()`s performance or `iglob`. Also trying different version of python could by usefull. Also check (for example using `htop`) whether the processes execute with the same priority/nice levels. — Vyktor, Oct 04 '14 at 09:52
@Puciek: [measurements on my machine](http://stackoverflow.com/questions/26178038/python-slow-read-performance-issue#comment41052983_26178038) show no significant difference between Perl and Python here. The results in the question show that the roles may reverse depending on how the tests are run i.e., Perl might execute much slower in some cases. The question definitely belongs here. — jfs, Oct 04 '14 at 11:13
@J.F.Sebastian Still running tests but the one thing that is consistent across all servers and all tests is that Perl slurp always has around 2K minor page faults but Python slurp always has around 50k minor page faults. No matter the actual speed they run the numbers are always those orders of magnitude. Read http://unixhelp.ed.ac.uk/CGI/man-cgi?time but not sure if this is significant? — Simon, Oct 04 '14 at 12:55
@J.F.Sebastian See update 4. Some rather interesting results thanks to your test and repeated runs using different orders I can now highlight the issue with nothing but a 'cp' command. Going to do more investigation when I can think of what to try next. Also confirmed this by running my original Perl/Python scripts on 5M XML files where I did a 'cp -r' of the top level and ran both scripts on both directories, clearing the cache between each run. Performance for the Perl/Python scripts followed characteristics described in the bullet list in my update. — Simon, Oct 04 '14 at 16:42
BTW: I recently learned that [strace 4.9 can summarise wall-clock times with -wc](http://stackoverflow.com/a/26195382/1026), I think using it would make strace results much more readable. — Nickolay, Oct 04 '14 at 17:40
Interesting rsync works in the opposite way to cp in terms of performance, updated post. — Simon, Oct 04 '14 at 20:55
@FredMitchell Sorry finally got round to testing with different numbers, problem is the same from 50K-25k-12.5k-5k-2.5k-1k the gap between performance drops off as the volume count gets lower but it's still clearly visible — Simon, Oct 04 '14 at 22:49
I expected the differences to drop. I did not even care if you reran the Perl. I was just wondering if the elapsed time was linearly related to the number of files that you read - to eliminate effects like what glob.iglob might be doing as well as some strange garbage collection. For example, when the iglob iterates, does it have to access the file system each time - after all, many things might have happened between when you got the first one and executed next (). — Fred Mitchell, Oct 05 '14 at 00:27

score 25 · Accepted Answer · edited Oct 13 '14 at 08:16

I will focus on only one of your examples, because rest things should be analogical:

What I think, may matter in this situation is Read-Ahead (or maybe another technique related to this) feature:

Let consider such example:

I have created 1000 xml files in "1" dir (names 1.xml to 1000.xml) as you did by dd command and then I copied orginal dir 1 to dir 2

$ mkdir 1
$ cd 1
$ for i in {1..1000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done
$ cd ..
$ cp -r 1 2
$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time strace -f -c -o trace.copy2c cp -r 2 2copy
$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time strace -f -c -o trace.copy1c cp -r 1 1copy

In the next step I debugged cp command (by strace) to found out in what order data are copied:

So cp does it in following order (only first 4 files, because I saw that the second read from original directory is more time consuming that second read from copied directory)

100.xml 150.xml 58.xml 64.xml ... * in my example

Now, take a look on filesystem blocks which are used by these files (debugfs output - ext3 fs):

Original directory:

BLOCKS:
(0-9):63038-63047 100.xml
(0-9):64091-64100 150.xml
(0-9):57926-57935 58.xml
(0-9):60959-60968 64.xml
....


Copied directory:
BLOCKS:
(0-9):65791-65800 100.xml
(0-9):65801-65810 150.xml
(0-9):65811-65820 58.xml
(0-9):65821-65830 64.xml

....

As you can see, in the "Copied directory" the block are adjacent, so it means that during reading of the first file 100.xml the "Read Ahead" technique (controller or system settings) can increase performance.

dd create file in order 1.xml to 1000.xml, but cp command copies it in another order (100.xml, 150.xml, 58.xml,64.xml). So when you execute:

cp -r 1 1copy

to copy this dir to another, the blocks of files which you are copied are not adjacent, so read of such files take more time.

When you copy dir which you copied by cp command (so files are not created by dd command), then file are adjacent so creating:

cp -r 2 2copy

copy of the copy is faster.

Summary: So to test performance python/perl you should use the same dir (or two dirs copied by cp command) and also you can use option O_DIRECT to read bypassing all kernel buffers and read data from disk directly.

Please remember, that results can be different on different type of kernel, system, disk controller, system settings, fs and so on.

Additions:

 [debugfs] 
[root@dhcppc3 test]# debugfs /dev/sda1 
debugfs 1.39 (29-May-2006)
debugfs:  cd test
debugfs:  stat test.xml
Inode: 24102   Type: regular    Mode:  0644   Flags: 0x0   Generation: 3385884179
User:     0   Group:     0   Size: 4
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 2
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x543274bf -- Mon Oct  6 06:53:51 2014
atime: 0x543274be -- Mon Oct  6 06:53:50 2014
mtime: 0x543274bf -- Mon Oct  6 06:53:51 2014
BLOCKS:
(0):29935
TOTAL: 1

debugfs:

That's great info, I was starting to come to a similar conclusion in terms of the way the data is stored on disk between "original" files and copies. I'm going to have a play and see if this potentially boils down to the different order Perl and Python read from a directory tree by default. As I know even if I test both python and perl on the same directory the results will differ regardless of whether it's original or copied. debugfs is useful too that's a new tool for me :-) — Simon, Oct 06 '14 at 08:34
Yes:), then you can play around with: How Python/Perl read files (in which order) and how files are placed on the disk. — RaFD, Oct 06 '14 at 08:36
How did you get that BLOCKS output, I'm struggling to get similar from debugfs have found blocks command but can't get a nice output like yours above. — Simon, Oct 06 '14 at 10:44
So I fiddled with sort orders and considered the order files are created and ended up with the following on the 50K of files. Perl (default order) 13s, Python (default order) 4m 48s, Python (natural sort, so 1.xml, 2.xml) 11s. So clearly it's hugely important to consider the creation order and the order you read the files in, I never realised it would make as big a difference as this though so huge thanks for all the help on this thread. — Simon, Oct 06 '14 at 10:48
I added in my post how to debug by debugfs (section Additions) — RaFD, Oct 06 '14 at 10:56
Ah great thanks I see the BLOCKS is part of another command, cheers. — Simon, Oct 06 '14 at 11:12
Great catch! There was a hint in the performance of the programs reading from two different directories. I had noticed that, but was only thinking about things like really fragmented disks. That did not seem to be likely. — Fred Mitchell, Oct 08 '14 at 03:07

Python slow read performance issue

1 Answers1

Linked