Program that routes stdin to other files/programs based on prefix?

Question

I am looking for a program or (probably C)-code which I will refer to as route which takes a bytestring/stream on stdin and routes it to one of n programs, based on the intial few bytes (the prefix). The prefix will not be forwarded/piped to the destination program. I would imagine it would have a commandline interface like:

echo -n 'for-p1.perform-operation-A' | route 'for-p1.' >(program1) 'for-p2.' >(program2)
echo -n 'for-p2.perform-operation-B' | route 'for-p1.' >(program1) 'for-p2.' >(program2)

in the example given above, program1 would receive 'perform-operation-A' (as if I had executed `echo 'perform-operation-A' | program1), program2 would receive 'perform-operation-B'. The prefix always precedes the (virtual) filename of the destination.

Is there any existing solution for doing this or do I have to roll my own?

EDIT: By popular request, here is my 30min attempt at a solution, but I would vastly prefer existing solutions, or at least recommendations for libraries for the steps:

/*
build:
    sudo apt install -y build-essential && g++ main.cpp

usage example:

(
    rm output* || true
    echo -n "foo-hello" | ./a.out 2>/dev/null 'foo-' >(cat | tee -a output-foo) 'bar-' >(cat | tee -a output-bar) 1>/dev/null
    echo -n "bar-world" | ./a.out 2>/dev/null 'foo-' >(cat | tee -a output-foo) 'bar-' >(cat | tee -a output-bar) 1>/dev/null
    echo
    echo -n "output-foo: " ; cat output-foo ; echo
    echo -n "output-bar: " ; cat output-bar ; echo
)
*/

#include <fcntl.h>
#include <stdio.h>
#include <map>
#include <string>
using namespace std;

typedef FILE* File;

#define hasKey(map, key) (map.find((key)) != map.end())

// I assume this is pretty inefficient, looking for a good solution
void pipeRest(File fromFile, File toFile) {
    size_t bytesRead = 0;
    
    do {
        char data;
        bytesRead = fread(&data, 1, 1, fromFile);
        if (bytesRead) fwrite(&data, 1, 1, toFile);
    } while (bytesRead > 0);
}

int main(
    int const argc,
    char const * const * const argv
) {
    if (argc <= 1) {
        fprintf(stderr, "usage:\n\n\troute prefix1 >(ouput-program-1) prefix2 >(output-program-2) ...\n\nreads from stdin, recognizes any of n prefixes and pipes the rest to the filename following the prefix\n");
        exit(1);
    }

    // Parse [prefix outputFile] pairs from commandline args
    map<string, string> prefixesToOutputFileNames; 
    map<string, File> prefixesToOutputFiles; 
    for (int i = 0; i < argc; i++) {
        fprintf(stderr, "argv[%d] = '%s'\n", i, argv[i]);

        if (i > 0 && i % 2 == 0) {
            auto const prefix = argv[i - 1];
            fprintf(stderr, "prefix = '%s'\n", prefix);
            auto const outputFileName = argv[i];
            fprintf(stderr, "outputFileName = '%s'\n", outputFileName);
            prefixesToOutputFileNames[prefix] = outputFileName;

            auto const outputFile = fopen(outputFileName, "wb");
            
            prefixesToOutputFiles[prefix] = outputFile;
        }
    }

    // Start reading bytes from stdin, collect the prefix
    freopen(0, "rb", stdin); 

    string prefix = "";
    char nextPrefixChar[2] = { 0 };
    size_t bytesRead = 0;
    
    do {
        bytesRead = fread(nextPrefixChar, 1, 1, stdin);
        prefix += nextPrefixChar;
        fprintf(stderr, "read %zd bytes = '%s', prefix = '%s'\n", bytesRead, bytesRead ? nextPrefixChar : 0, prefix.c_str());

        if (hasKey(prefixesToOutputFiles, prefix)) {
            // Prefix found -> pipe to corresponding output file

            auto const outputFileName = prefixesToOutputFileNames[prefix];
            fprintf(stderr, "prefix '%s' was recognized, will pipe rest to '%s'\n", prefix.c_str(), outputFileName.c_str());

            auto const outputFile = prefixesToOutputFiles[prefix];
            pipeRest(stdin, outputFile);
            exit(0);
        }
    } while (bytesRead > 0);
    
    fprintf(stderr, "input ends and did not recognize any prefix, rest will be piped to stdout\n");
    
    pipeRest(stdin, stdout);

    return 0;
}

`routes it to one of n programs` - `The prefix always precedes the (virtual) filename` - so route to filenames or to programs? Routing to files/pipes is simpler. But why not just write it in shell? `How do you copy stdin most efficiently to an output file?` Please one question per post. Have you started writing such C program? What research did you do? What part of that C program are you having problem with? — KamilCuk, Mar 11 '21 at 09:59
Does this answer your question? [Read a file by bytes in BASH](https://stackoverflow.com/questions/13889659/read-a-file-by-bytes-in-bash) — , Mar 11 '21 at 10:05
@KamilCuk thanks for your feedback. In bash, using `>(program)' turns a program into a file that can be written to. That will be my primary usecase, but 'route prefix my-output-file.txt' would also be admissible. I have written such a program, but I am unsure about the efficiency of it (to redirect the output after detecting the prefix, I read and write byte by byte).... — masterxilo, Mar 11 '21 at 10:06
Why do you care about efficiency? Have you followed [the rules of optimization](https://wiki.c2.com/?RulesOfOptimization)? Have you profiled the code? Are you sure that I/O operations from stream is the bootleneck? Why not just `tee >(sed -n 's/^for.p1.//p' | program1) | sed -n 's/^for.p2.//p' | program2` and forget about it? `I have written such a program` Then please post it. — KamilCuk, Mar 11 '21 at 10:07
@DavidCullen no, the question does not address "routing" the rest of the file based on the initial content. — masterxilo, Mar 11 '21 at 10:08
@KamilCuk I did not come up with this formulation/expression of a possible solution with `sed`, that's exactly the kind of answer/suggestion I am hoping for. It is important to me that the rest of the stream is preserved byte by byte (not interpreted as characters or anything), since I want to redirect the data to `curl` or `md5sum` or `gzip` etc. in the end. — masterxilo, Mar 11 '21 at 10:11
I care about performance because I hope to use this program as an intermediary for one sided program-to-program communication/data piping which I hope would operate at RAM/CPU speed, and the streams/ would never hit the disk. — masterxilo, Mar 11 '21 at 10:15
@masterxilo The routing would be handled by a bash `case` statement. Once you have the prefix, you can then use other bash commands to achieve the desired effect. Since you haven't provided your existing code, or any sample data, without changes, this question will most likely be closed. — , Mar 11 '21 at 10:19
@DavidCullen, @KamilCuk I have added my solution. I would be interested to see a fully working implementation of this `route` program using `sed` and also using `case`... — masterxilo, Mar 11 '21 at 10:28
My solution does not support 0-bytes in the prefix, but it could easily be changed to read the prefix bytestrings from files instead. But again, I would assume this problem must have been solved properly before... — masterxilo, Mar 11 '21 at 10:38

KamilCuk · Answer 1 · 2021-03-11T10:50:55.023

1

In bash shell, you would write just:

tee >(sed -n 's/^for\.p1\.//p' | program1) | sed -n 's/^for\.p2\.//p' | program2

The following program:

filter() {
    keyword=$1
    shift
    # https://stackoverflow.com/questions/407523/escape-a-string-for-a-sed-replace-pattern
    keyword=$(printf '%s\n' "$keyword" | sed -e 's/[]\/$*.^[]/\\&/g');
    LC_ALL=C sed -n "s/^$keyword//p" | "$@"
}
hexfilter() {
    keyword=$1
    shift
    keyword=$(printf '%s\n' "$keyword" | sed -e 's/../[\\x&]/g');
    LC_ALL=C sed -n "s/^$keyword//p" | "$@"
}
program1() { :; }
program2() { :; }

echo -e '\x00\xca\xfe\x2a world!' |
{ tee >(filter 'for.p1.' program1 >&3) >(hexfilter '00cafe2a' sed 's/^/Hello/' >&3) | filter 'for.p2.' program2 >&3; } 3>&1 |
cat

filters prefix 0x00 0xca 0xfe 0x2a and outputs Hello world!.

edited Mar 11 '21 at 10:50

answered Mar 11 '21 at 10:26

KamilCuk

120,984
8
59
111

Will sed interpret the bytes as characters? I would like to be able to use any bytestring as a prefix, if necessary I would need an automatic solution for escaping the bytes accordingly for this `sed` specification. In particular, a 0-byte should be possible in the prefix, and I know that bash/linux in general does not support 0-bytes in cli-arguments, so maybe a file would have to be used to specify the prefix... – masterxilo Mar 11 '21 at 10:31
How would this generalize to 3 or more prefixes & programs? – masterxilo Mar 11 '21 at 10:33
`Will sed interpret the bytes as characters?` Unspecified in POSIX, GNU sed has some unicode support. With `LC_ALL=C sed` then bytes are bytes - it only cares about newlines, as they end lines. `In particular, a 0-byte should be possible in the prefix` Don't trust me - test it. `echo abcd00ef0a | xxd -r -p | LC_ALL=C sed 's/^.*\x00//' | xxd -p` works fine here. `How would this generalize to 3 or more prefixes & programs?` `tee >(prog1) >(prog2) >(prog3) | prog4` etc. With bash `eval` you can generate it - remember to use `printf "%q"` to properly escape – KamilCuk Mar 11 '21 at 10:38
1

But indeed - handling of zero bytes is hard, because arguments themselves end with zero bytes (no matter the language). Store strings in it's hex form and convert to/from with `xxd`. – KamilCuk Mar 11 '21 at 10:41
I like the extended version with filter & hexfilter! I can see how this would generalize arbitrarily. I assume like this there should not be a problem with newlines/cr either (0x0a/0x0d)? – masterxilo Mar 11 '21 at 11:00
1

Newlines end the line. `sed` works with lines. There will be a "problem" with newlines. – KamilCuk Mar 11 '21 at 11:03

Program that routes stdin to other files/programs based on prefix?

1 Answers1