3

I'm working on optimizing a Python script that needs to parse a huge (12 TB) amount of data. At the moment, it basically looks like:

gzip -d -c big_file.gz | sed /regex|of|interesting|things/p | script.py 

(actually, the piping is being done by subprocess.Popen, but I don't think that's important -- correct me if I'm wrong.)

It appears that the gzip->sed->python pipes are currently the most time consuming part of the script. I assume that this is because there are three separate processes in play here: since none of them can have a shared address space, any information that needs to be passed between them needs to actually be copied from one to the other, so the three pipes result in a total of at most 36 TB being pushed through my RAM rather than just 12.

Am I understanding correctly what's going on?

Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
  • There aren't really any answers here, but you may want to look at [this question](http://stackoverflow.com/questions/24199026/how-to-speed-up-communication-with-subprocesses). The OP noticed that using `subprocess.Popen` to pipe data was much slower than using Linux pipes (`|`), so it may actually be important if you're using the former. – dano Jul 17 '14 at 15:34
  • `gzip big_file.gz` ? Shouldn't it be `gunzip` or `gzip -d` ? – bruno desthuilliers Jul 17 '14 at 15:34
  • @brunodesthuilliers Yes, you're right, I've edited to fix. – Patrick Collins Jul 17 '14 at 15:35
  • @dano I'm really wondering about the total amount of data moved through the system when using pipes: based on what I know about multiprocessing, I assume that there is a pretty huge communication cost in play here, but I can't find any references to confirm it. – Patrick Collins Jul 17 '14 at 15:36
  • @PatrickCollins well, at the very least you're reading 12TB (probably less since it's compressed) from disk into `gunzip`, writing 12TB to the stdin of `sed` which then needs to be read, then writing 12TB more to the stdin of `script.py`, which also needs to be read. – dano Jul 17 '14 at 15:39
  • 1
    @PatrickCollins: `gunzip file.gz` doesn't write to stdout, it inflates the file inplace... – bruno desthuilliers Jul 17 '14 at 15:40
  • @brunodesthuilliers The line in question is actually `unzip = subprocess.Popen(['gzip', '-d', '-c', input_file, stdout=subprocess.PIPE])`, then `sed = subprocess.Popen(['sed', '-n', 'big|regex|'], stdin=unzip.stdout, stdout=subprocesspipe)` then `for fields in csv.reader(iter(sed.stdout.readline, '')):...`. – Patrick Collins Jul 17 '14 at 15:43

2 Answers2

3

Let's first run a little test:

time dd if=/dev/zero of=/dev/null bs=2M count=5000
time dd if=/dev/zero bs=2M count=5000 > /dev/null
time dd if=/dev/zero bs=2M count=5000 | cat > /dev/null
time dd if=/dev/zero bs=2M count=5000 | cat | cat | cat | cat | cat > /dev/null

Results:

5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 0.651287 s, 16.1 GB/s

real    0m0.653s
user    0m0.000s
sys     0m0.650s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 0.585887 s, 17.9 GB/s

real    0m0.587s
user    0m0.007s
sys     0m0.580s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 8.55412 s, 1.2 GB/s

real    0m8.556s
user    0m0.077s
sys     0m9.267s
5000+0 records in
5000+0 records out
10485760000 bytes (10 GB) copied, 9.69067 s, 1.1 GB/s

real    0m9.692s
user    0m0.397s
sys     0m25.617s

Adding a single pipe decreases performance hugely; adding multiple pipes decreases performance slightly. Results seem consistent across multiple runs.

I need to investigate more as to the why when I have more time, my guess is that the cat process read data with a small buffer, so the dd process writes slower.
There's a program called bfr which aims to solve this; I have never tried it. Last update is from 2004...

You could also try to implement gzip & a Python string replace. It's difficult to predict if the performance gain will be worth the time, though...

Martin Tournoij
  • 26,737
  • 24
  • 105
  • 146
  • Interesting. The pipe adds a 4KB buffer between the processes and `gzip` needs to have an output buffer. It's odd that `cat` causes such a huge slowdown (one order of magnitude). – Aaron Digulla Jul 17 '14 at 15:48
  • @AaronDigulla Yeah, it's larges than I expected, but it's well known that block size can have a *huge* effect on performance; Here's a related example I created: http://tmp.arp242.net/bs.txt – Martin Tournoij Jul 17 '14 at 15:52
  • Perhaps messing with the [`bufsize`](https://docs.python.org/2/library/subprocess.html#subprocess.Popen) in the `Popen` constructor will help the OP here? By default it's unbuffered. – dano Jul 17 '14 at 15:53
2

The pipes probably aren't your problem. Modern PCs can copy memory at rates of 70GB/s.

If you want to know how much time the first stage takes, run:

 time gunzip big_file.gz | sed '/regex|of|interesting|things/p' > /dev/null

That will unpack the data and filter it and then tell you how long that took.

My guess is that the poor Python script gets too much data and processing huge amounts of data with Python simply takes time.

[EDIT] I just noticed something: The Python docs say:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered [...] The default value for bufsize is 0 (unbuffered).

Try with bufsize=-1 or bufsize=102400 when you create the pipes.

Lesson to take from this: Buffered pipes are fast, unbuffered pipes are slow.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • I'm trying to convince my boss to let me port it to Java for performance. I assumed the same thing, but I see that the function with the calls to `sed.stdout.readline` (which does very little parsing) is taking up 60% of the run time -- I can't think of why this would be the case unless `subprocess.PIPE` was causing the slowdown. – Patrick Collins Jul 17 '14 at 15:42
  • I'm 90% sure your assumption is wrong. Use the `time` command and a profiler to see which parts of the pipeline take how much time. Also note that running this with three processes means each process can run in parallel (which isn't true for Python)[https://wiki.python.org/moin/GlobalInterpreterLock]. – Aaron Digulla Jul 17 '14 at 15:46
  • I'm talking about output from `cProfile` when I talk about runtime. – Patrick Collins Jul 17 '14 at 15:46
  • @AaronDigulla Actually, all three run in parallel in Python, too. He's not using threads here, he's using the `subprocess` module. – dano Jul 17 '14 at 15:47
  • @dano: I meant the code in the Python script which reads the result can't run in parallel. The processes always run in parallel :-) – Aaron Digulla Jul 17 '14 at 15:49
  • Well, it's simple to find out: Create a subset of the test data, unpack it somewhere and then feel Python with the file (i.e. avoid the pipes) and measure how long that takes. – Aaron Digulla Jul 17 '14 at 15:50
  • @PatrickCollins: I just saw that the default for pipes in Python is "unbuffered" which is very, very wrong. See my edits. – Aaron Digulla Jul 17 '14 at 15:58
  • @AaronDigulla I'll do another run now with `buffsize` set, I'll report back with the results (the last run took about an hour, so it will be a while). – Patrick Collins Jul 17 '14 at 16:01
  • @PatrickCollins: For the performance tests, create a subset of data which you can process in 2-3 minutes. – Aaron Digulla Jul 17 '14 at 16:03
  • 1
    @AaronDigulla That cut down runtime by about 20 minutes, and the bottleneck is now inside of Python parsing code (as it should be). – Patrick Collins Jul 17 '14 at 16:49