0

If I want to pipe bytes of data in to a C/C++ program on Linux like this:

cat my_file | ./my_app

but:

  1. We cannot assume the piped data is going to originate from a file
  2. We wish to interpret the data as bytes in the file (as opposed to strings)

what would be the fastest technique to read the pipe from the C/C++ application?

I have done a little research and found:

  • read()
  • std::cin.read()
  • popen()

but I am not sure if there is a better way, or which of the above would be better.

EDIT: There is a performance requirement on this, hence why I am asking for the technique with the smallest overhead.

user997112
  • 29,025
  • 43
  • 182
  • 361
  • Note `popen()` doesn't read anything from the pipe, but just gives you the necessary file descriptors you can use to call `read()`. – πάντα ῥεῖ Feb 21 '14 at 19:02
  • Not `popen()`. In C, use `read()`; in C++, probably `read()` again, but using `std::cin` would also work. – Jonathan Leffler Feb 21 '14 at 19:03
  • 1
    And why are you bothering about performance? Using the `|` in the shell and reading from `std::cin` would be fairly OK, flexible and robust. – πάντα ῥεῖ Feb 21 '14 at 19:04
  • FYI, `cat my_file | ./my_app` is slower than `./my_app – Charles Duffy Feb 21 '14 at 19:07
  • @CharlesDuffy simply my misunderstanding that there were two ways and one is faster... – user997112 Feb 21 '14 at 19:08
  • Also, minimizing overhead and maximizing throughput are completely different goals. Please be specific about which one you care about. – Charles Duffy Feb 21 '14 at 19:10
  • If you *do* get a proper file descriptor on stdin, you might actually be better off mmap'ing it. If you *don't* get a real file descriptor, and your real goal is to minimize overhead, you'd use the `read()` syscall directly. If you *don't* get a real file descriptor, and your real goal is to maximize throughput, then you'll likely do well to use the buffered calls exposed by the standard library. – Charles Duffy Feb 21 '14 at 19:13
  • @CharlesDuffy what determines whether I get a proper file descriptor? – user997112 Feb 21 '14 at 19:15
  • I meant a _seekable_ file descriptor. Which is to say, not attached to a pipeline or socket but directly to a file. – Charles Duffy Feb 21 '14 at 19:16
  • @CharlesDuffy It sounds like read() would be better then. Could you suggest this in an answer and I will accept – user997112 Feb 21 '14 at 19:20
  • **What is your application?** Why do you care that much about performance? Please edit the question to improve it. – Basile Starynkevitch Feb 21 '14 at 19:56

2 Answers2

4

Why do you care that much about performance?

1 gigabyte from /dev/urandom can be piped into wc in 1 minutes (and wc is running 15% of the time, waiting for data on the rest) ! Just try time (head -1000000000c /dev/urandom|wc)

But the fastest way would be to use the read(2) syscall with a quite big buffer (e.g. 64Kbytes to 256Kbytes).

Of course, read Advanced Linux Programming and carefully syscalls(2) related man pages.

Study for inspiration the source code of the Linux kernel, of GNU libc, of musl-libc. They all are open source projects, so feel free to contribute to them and to improve them.

But I bet that in practice using popen, or stdin, or reading from std::cin won't add much overhead.

You could also increase the stdio buffer with setvbuf(3).

See also this question.

(If you read from stdin the file descriptor is STDIN_FILENO which is 0)

You might be interested by time(7), vdso(7), syscalls(2)

You certainly should read documentation of GCC and this draft report.

You could use machine learning techniques to optimize performance.

Look into the MILEPOST GCC and Ctuning projects. Consider joining the RefPerSys one. Read of course Understanding machine learning: From theory to algorithms ISBN 978-1-107-05713-5

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • May I ask, how would I get the file descriptor for the second parameter? – user997112 Feb 21 '14 at 19:22
  • Sorry one last question- I am about to replace my std::cin.read() with the system call read() you suggested. However, I also have a while loop checking for !std::cin.eof() so that I grab all bytes. To see the full benefits of the read() system call is there a way I can replace the check for while(!std::cin.eof()) ? – user997112 Feb 21 '14 at 19:29
  • 1
    I don't understand why you care that much about performance. But please read the man page of `read(2)`. **What is your application?** Please edit your question to tell much more about it. – Basile Starynkevitch Feb 21 '14 at 19:42
  • @user997112, don't mix buffered and unbuffered calls regarding the same file descriptor -- if you're using `read()` (which is unbuffered), you shouldn't be using **any** `std::cin` calls. – Charles Duffy Feb 21 '14 at 20:36
2

When you pipe data in like that, the piped input is the standard input. Just read from cin (or stdin) like a normal console program.

Just use std::cin.read(). There's no reason to deal with popen() or its ilk.


Just to clarify... there is no pipe-specific way to read the input. As far as your program is concerned, there's cin and that's it.

This question might help you out on the speed front though... Why is reading lines from stdin much slower in C++ than Python?

Community
  • 1
  • 1
QuestionC
  • 10,006
  • 4
  • 26
  • 44
  • This has to be written with high performance in mind- thats why I am asking what is the method with the smallest overhead? – user997112 Feb 21 '14 at 19:08
  • Minimizing overhead is very situationally dependent -- sometimes standard-library-provided buffering helps things, sometimes it hurts. I don't know that you could get a generic answer that's always going to be right; you'd probably be better off benchmarking. – Charles Duffy Feb 21 '14 at 19:09
  • Besides using std::cin::read() what other possibilities do I have to benchmark against? – user997112 Feb 21 '14 at 19:13
  • Well, `std::cin::read()` is the C++ way. There's also the C standard-library calls, and direct invocation of your OS syscalls. And your operating system and standard library will typically provide a lot of flags for tuning. – Charles Duffy Feb 21 '14 at 19:15
  • But unless you've already measured enough to know you have a problem, and you know it's the read process that's causing the problem, why are you here to start with? :) – Charles Duffy Feb 21 '14 at 19:15