29

I have a small C program to calculate hashes (for hash tables). The code looks quite clean I hope, but there's something unrelated to it that's bugging me.

I can easily generate about one million hashes in about 0.2-0.3 seconds (benchmarked with /usr/bin/time). However, when I'm printf()inging them in the for loop, the program slows down to about 5 seconds.

  1. Why is this?
  2. How to make it faster? mmapp()ing stdout maybe?
  3. How is stdlibc designed in regards to this, and how may it be improved?
  4. How could the kernel support it better? How would it need to be modified to make the throughput on local "files" (sockets,pipes,etc) REALLY fast?

I'm looking forward for interesting and detailed replies. Thanks.

PS: this is for a compiler construction toolset, so don't by shy to get into details. While that has nothing to do with the problem itself, I just wanted to point out that details interest me.

Addendum

I'm looking for more programatic approaches for solutions and explanations. Indeed, piping does the job, but I don't have control over what the "user" does.

Of course, I'm doing a testing right now, which wouldn't be done by "normal users". BUT that doesn't change the fact that a simple printf() slows down a process, which is the problem I'm trying to find an optimal programmatic solution for.


Addendum - Astonishing results

The reference time is for plain printf() calls inside a TTY and takes about 4 mins 20 secs.

Testing under a /dev/pts (e.g. Konsole) speeds up the output to about 5 seconds.

It takes about the same amount of time when using setbuffer() in my testing code to a size of 16384, almost the same for 8192: about 6 seconds.

setbuffer() has apparently no effect when using it: it takes the same amount of time (on a TTY about 4 mins, on a PTS about 5 seconds).

The astonishing thing is, if I'm starting the test on TTY1 and then switch to another TTY, it does take just the same as on a PTS: about 5 seconds.

Conclusion: the kernel does something which has to do with accessibility and user friendliness. HUH!

Normally, it should be equally slow no matter if you stare at the TTY while its active, or you switch over to another TTY.


Lesson: when running output-intensive programs, switch to another TTY!

Flavius
  • 13,566
  • 13
  • 80
  • 126
  • 2
    If you redirect the output to /dev/null, how fast is your program then? – Erich Kitzmueller Dec 02 '09 at 12:22
  • @ammoQ: Just as fast as when redirecting to any regular file: about 0.5 seconds. – Flavius Dec 02 '09 at 12:41
  • 1
    It's not a "simple" matter. I/O is generally orders of magnitude slower than straight up CPU computations and bus operations, it should not be that astonishing to realise it. – Michael Foukarakis Dec 02 '09 at 13:33
  • It's astonishing that if you look at the TTY while the process executes and displays something, it will take 4 minutes to execute. If you don't look at the TTY, it takes 5 secs. – Flavius Dec 02 '09 at 14:17
  • Normally, it should be equally slow no matter if you stare at the TTY while its active, or you switch over to another TTY. – Flavius Dec 02 '09 at 14:23
  • Nobody's going to read that much output anyway, so just output every 100th result. :) – UncleBens Dec 02 '09 at 15:09
  • 4
    Flavius: That's because when the TTY is displayed, each new line requires scrolling up the entire screen. Each character cell on screen is mapped to a specific location in the screen buffer - so moving characters around means moving bytes around in the screen buffer. On an 80 column console, that means moving 24 lines upwards is essentially a `memmove` of almost 2k - which is done *for each line you output*. – caf Dec 02 '09 at 23:04
  • +1 for updating with your results. Thanks. – edoloughlin Dec 03 '09 at 09:36
  • @caf: Although I knew that, I didn't realize it. Thanks for the great explanation :) – Flavius Dec 03 '09 at 16:37
  • It really sounds like it the graphics update of your terminal that is responsible for the extra time. – stsquad Dec 04 '09 at 09:05
  • When you run against your TTY it is both an output device AND an Input device. While you are outputting chars it is looking to see if you type a key like CTRL C, or CTRL S/Q to stop the display. This maybe why it takes so long.... or least contribute to the slowness. – AnthonyLambert Nov 06 '12 at 13:01

9 Answers9

35

Unbuffered output is very slow.

By default stdout is fully-buffered, however when attached to terminal, stdout is either unbuffered or line-buffered.

Try to switch on buffering for stdout using setvbuf(), like this:

char buffer[8192];

setvbuf(stdout, buffer, _IOFBF, sizeof(buffer));
qrdl
  • 34,062
  • 14
  • 56
  • 86
15

You could store your strings in a buffer and output them to a file (or console) at the end or periodically, when your buffer is full.

If outputting to a console, scrolling is usually a killer.

edoloughlin
  • 5,821
  • 4
  • 32
  • 61
  • 4
    +1, especially for scrolling. Just imagine all the blitting and bitmap copying involved in scrolling... – sleske Dec 02 '09 at 12:24
  • 2
    Your response made me test the program under a clean TTY, and under a managed PTS of Konsole. The result: Konsole speeds things up quite a bit! It took 4 mins 20 secs when run from TTY (which should be used as the true reference for the test, I think), 5 secs from PTY. – Flavius Dec 02 '09 at 12:27
  • 1
    another +1 for scrolling. Simply running some chatty program in GNU screen (then detach it) would speed things up a lot! – Lester Cheung Jan 07 '11 at 04:03
9

If you are printf()ing to the console it's usually extremely slow. I'm not sure why but I believe it doesn't return until the console graphically shows the outputted string. Additionally you can't mmap() to stdout.

Writing to a file should be much faster (but still orders of magnitude slower than computing a hash, all I/O is slow).

Andreas Bonini
  • 44,018
  • 30
  • 122
  • 156
7

You can try to redirect output in shell from console to a file. Using this, logs with gigabytes in size can be created in just seconds.

corvus
  • 2,330
  • 2
  • 18
  • 16
7
  1. I/O is always slow in comparison to straight computation. The system has to wait for more components to be available in order to use them. It then has to wait for the response before it can carry on. Conversely if it's simply computing, then it's only really moving data between the RAM and CPU registers.

  2. I've not tested this, but it may be quicker to append your hashes onto a string, and then just print the string at the end. Although if you're using C, not C++, this may prove to be a pain!

3 and 4 are beyond me I'm afraid.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
n00dle
  • 5,949
  • 2
  • 35
  • 48
4
  1. Why not create the strings on demand rather that at the point of construction? There is no point in outputting 40 screens of data in one second how can you possibly read it? Why not create the output as required and just display the last screen full and then as required it the user scrolls???

  2. Why not use sprintf to print to a string and then build a concatenated string of all the results in memory and print at the end?

  3. By switching to sprintf you can clearly see how much time is spent in the format conversion and how much is spent displaying the result to the console and change the code appropriately.

  4. Console output is by definition slow, creating a hash is only manipulating a few bytes of memory. Console output needs to go through many layers of the operating system, which will have code to handle thread/process locking etc. once it eventually gets to the display driver which maybe a 9600 baud device! or large bitmap display, simple functions like scrolling the screen may involve manipulating megabytes of memory.

AnthonyLambert
  • 8,768
  • 4
  • 37
  • 72
  • 1
    Regarding (4): I realize that, BUT if I were a OS writer, would it be possible to copy the output from a location to another location/process? If yes, how would I go about it, in your opinion, so that the things speed up? – Flavius Dec 02 '09 at 12:31
  • 1
    In the good old days games programmers used to address the output device directly so for instance actually write the characters into the display memory - Today even they in the most part use libraries to talk to the hardware so they can be device independent and take advantage of hardware acceleration. It is rarely worth circumventing these layers today. – AnthonyLambert Dec 02 '09 at 13:40
4

As I/O is always much slower than CPU computation, you might store all values in fastest possible I/O first. So use RAM if you have enough, use Files if not, but it is much slower than RAM.

Printing out the values can now be done afterwards or in parallel by another thread. So the calculation thread(s) may not need to wait until printf has returned.

Marco
  • 2,773
  • 1
  • 14
  • 8
4

I discovered long ago using this technique something that should have been obvious. Not only is I/O slow, especially to the console, but formatting decimal numbers is not fast either. If you can put the numbers in binary into big buffers, and write those to a file, you'll find it's a lot faster.

Besides, who's going to read them? There's no point printing them all in a human-readable format if nobody needs to read all of them.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
2

I guess the terminal type is using some buffered output operations, so when you do a printf it does not happen to output in split micro-seconds, it is stored in the buffer memory of the terminal subsystem.

This could be impacted by other things that could cause a slow down, perhaps there's a memory intensive operation running on it other than your program. In short there's far too many things that could all be happening at the same time, paging, swapping, heavy i/o by another process, configuration of memory utilized, maybe memory upgrade, and so on.

It might be better to concatenate the strings until a certain limit is reached, then when it is, write it all out at once. Or even using pthreads to carry out the desired process execution.

Edited: As for 2,3 it is beyond me. For 4, I am not familiar with Sun, but do know of and have messed with Solaris, There may be a kernel option to use a virtual tty.. i'll admit its been a while since messing with the kernel configs and recompiling it. As such my memory may not be great on this, have a root around with the options to see.

user@host:/usr/src/linux $ make; make menuconfig **OR kconfig if from X**

This will fire up the kernel menu, have a dig around in to see the video settings section under the devices sub-tree..

Edited: but there's a tweak you put into the kernel by adding a file into the proc filesystem (if a such thing does exist), or possibly a switch passed into the kernel, something like this (this is imaginative and does not imply it actually exists), fastio

Hope this helps, Best regards, Tom.

t0mm13b
  • 34,087
  • 8
  • 78
  • 110
  • Thanks for your answer. It's a linux machine, as you can see in the tags of the question. – Flavius Dec 02 '09 at 12:43
  • @Flavius: Whoops, sorry there for the Sun and solaris bit, I was sure I saw it in there a while ago when I was editing the answer. Must have got mixed up with some other thread here on SO...Apologies – t0mm13b Dec 02 '09 at 13:13