11

This program prints 65k bytes per line.

I measure the throughput with ./a.out | pv >/dev/null and get around 3 GB/s.

As soon as I change the line length to 70k, the throughput drops to ~ 1 GB/s.

Which bottleneck (CPU cache, libc idiosynchrasy, etc.) am I hitting here?

#include <stdio.h>
#include <string.h>

#define LEN 65000     // high throughput
// #define LEN 70000  // low throughput

int main ()
{
  char s[LEN]; memset(s, 'a', LEN-1); s[LEN-1] = '\0';

  while (1)
    printf ("%s\n", s);
}

Update: I'm running this on Ubuntu 12.04 64-bit, which has EGLIBC 2.15, on a Core i5-2520M.

Update: puts (s) has the same problem.

nh2
  • 24,526
  • 11
  • 79
  • 128
  • 2
    Also significant - what libc are you using, and what version? –  Aug 07 '12 at 19:07
  • 2
    Probably internal buffer of libc, especially inside `printf`. How about `puts`? – Todd Li Aug 07 '12 at 19:07
  • 2
    I'm not sure if this is useful information, but `65000` fits in two bytes, whereas `70000` needs three. Maybe that could be a jumping off point for someone more knowledgeable than me? – Gordon Bailey Aug 07 '12 at 19:08
  • You need to run your program under the profiler to answer this question. – rob mayoff Aug 07 '12 at 19:12
  • Is the bottleneck in your a.out, or is it in pv? – Eric Postpischil Aug 07 '12 at 19:12
  • better question: why are you printink 64k at a time? – tbert Aug 07 '12 at 19:16
  • @duskwuff I added libc version and machine specs. – nh2 Aug 07 '12 at 19:20
  • @Gordon Bailey I recognize the 16 bit boundary, but I am not aware of a cache of this size in my system. – nh2 Aug 07 '12 at 19:22
  • There's probably a 64k buffer within libc that's used for stdio output. A 65,000 byte print fits within the buffer (and goes out in a single write call), whereas a 70,000 byte print requires two writes. –  Aug 07 '12 at 19:25
  • @duskwuff That's what I thought. However I tested it and at least with my libc, it uses 2 writes for everything larger than `4096`. – cnicutar Aug 07 '12 at 19:27
  • @forcey Good idea, but puts has the same phenomenon. I'll update the question about this. – nh2 Aug 07 '12 at 19:32
  • FWIW, I can repro this on GCC 4.4.3 x86. – Mysticial Aug 07 '12 at 19:36
  • I just tried this on a different machine (x64) which has a significantly larger L2 cache size. The difference between them is much smaller... (2.13 GBs / 1.72 GB/s) – Mysticial Aug 07 '12 at 19:57

1 Answers1

6

You are suffering from under utilizing the kernel I/O buffer in your data transfer. If we assume 64KB is the kernel I/O buffer size, then a 70000 write will block after 64KB is written. When it is drained the remaining 4KB + change is written into the I/O buffer. pv ends up doing two reads to read each 70000 bytes transfered, resulting in about half your normal throughput due to bad buffer utilization. The stall in I/O during write probably makes up the rest.

You can specify a smaller read size to pv, and this will increase your throughput, by increasing your average bytes transferred per time slice. Writes will be more efficient on average, and keeps read buffers full.

$ ./a.out | pv -B 70000 > /dev/null
9.25GB 0:00:09 [1.01GB/s] [        <=>                                        ]

$ ./a.out | pv -B 30k > /dev/null
9.01GB 0:00:05 [1.83GB/s] [    <=>                                            ]

Edit: Three more runs (2.7GHz core i7)

$ ./a.out | pv -B 16k > /dev/null
  15GB 0:00:08 [1.95GB/s] [       <=>                                         ]

$ ./a.out | pv -B 16k > /dev/null
 9.3GB 0:00:05 [1.85GB/s] [    <=>                                            ]

$ ./a.out | pv -B 16k > /dev/null
19.2GB 0:00:11 [1.82GB/s] [          <=>                                      ]
jxh
  • 69,070
  • 8
  • 110
  • 193
  • The difference in block size seems to be negligible on my machine. – Mysticial Aug 07 '12 at 19:38
  • I tried 30k, 80k, and 500k. All within 15% of each other. The `65000` and `70000` in the code makes a 2x difference though. – Mysticial Aug 07 '12 at 19:44
  • @Mysticial: Bigger won't improve. Try 4k, 8k, and 16k. – jxh Aug 07 '12 at 19:45
  • @user315052:- Please try three more times and let us know if the average complies. – perilbrain Aug 07 '12 at 19:46
  • @user315052 Going to 16k significantly hurts it. But no matter what block size I choose, it never gets even half the speed of `LEN = 65000`. – Mysticial Aug 07 '12 at 19:47
  • Nice explanation. Where can I look up what my kernel I/O buffer size is? – nh2 Aug 07 '12 at 19:48
  • @nh2: There is [this question](http://stackoverflow.com/questions/4624071/pipe-buffer-size-is-4k-or-64k) that found the limit experimentally. – jxh Aug 07 '12 at 19:58
  • I just tried this on a different machine with a massive L2 cache. The difference (with default block sizes) is much less. While I don't want to jump to conclusions, I don't want to entirely rule out cache effects. – Mysticial Aug 07 '12 at 20:00
  • @Mysticial Concerning caches, my processor has 128k L1 cache, which would fit for 64k source and 64k target on a memory copy, but this might be negligible compared to the IO buffer. – nh2 Aug 07 '12 at 20:04
  • 1
    @nh2 It's actually only 32k *per core*. But you're right that both test cases fit comfortably into the L2 caches. – Mysticial Aug 07 '12 at 20:10
  • 1
    @nh2 Here's some more numbers to look at: http://pastebin.com/WyEeN5Lp With the exception of the 200k block size, there's basically no difference. (edit: updated link with more detail) – Mysticial Aug 07 '12 at 20:16
  • @Mysticial: Your massive L2 cache greatly reduces the expense of system call and context switching overhead. This accounts for why your observations are different, but it doesn't change the analysis of the OPs problem. I only have a 256KB L2, and 8MB L3 on my machine. Pastebin [here](http://pastebin.com/5tg2WvQj). – jxh Aug 07 '12 at 21:01