0

I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.

#!/bin/bash

for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
    printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'

The expected output would be:

<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...

I can think of a bash function that would do that, something like:

chars_to_strings() {
    local delim buffer
    while true
    do
        delim=''
        IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'

        if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
        then
            # Do the replacements in "$buffer"
            # ...

            printf "%s%s" "$buffer" "$delim"
        else
            break
        fi
    done
}

But I'm looking for a more efficient way, any thoughts?

Fravadona
  • 13,917
  • 1
  • 23
  • 35

5 Answers5

2

Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?

sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'

Or, via separate commands:

sed -e $'s/\a/<bell>/g' \
    -e $'s/\b/<backspace>/g' \
    -e $'s/\t/<horizontal-tab>/g' \
    -e $'s/\v/<vertical-tab>/g'

Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):

$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
  {
    gsub(/\a/, "<bell>")
    gsub(/\b/, "<backspace>")
    gsub(/\t/, "<horizontal-tab>")
    gsub(/\v/, "<vertical-tab>")
    print $0
  }
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
Ionuț G. Stan
  • 176,118
  • 18
  • 189
  • 202
  • Whether or not `sed` tolerates those bytes in its input is another matter. Maybe try Perl instead if you are on a platform with a very traditional `sed`. – tripleee Sep 22 '22 at 14:27
  • I thought of it because most `sed` implementations dynamically allocate their input buffer, but it crashes when you don't encounter any newline character and don't have enough RAM to fit the input. Also, it will be tricky to replace a newline character when it is in the list of characters to replace – Fravadona Sep 22 '22 at 14:27
  • @tripleee you're right. It seems to work as expected with macOS's built-in sed, but the output seems confused with GNU sed. – Ionuț G. Stan Sep 22 '22 at 14:31
  • @Fravadona I've added an AWK version too, which seems to handle your large sample input quite well. – Ionuț G. Stan Sep 22 '22 at 14:50
  • Trad Awk (Debian package `original-awk`) does not seem to be able to recognize `\t` or `\v`. I would also expect it to have issues with completely unbounded input. – tripleee Sep 22 '22 at 15:10
  • @tripleee huh, thanks! That's too bad then. – Ionuț G. Stan Sep 22 '22 at 15:55
1

For a simple one-liner with reasonable portability, try Perl.

for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
    printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
  s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'

Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.

Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...

A few items:

  • continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
  • push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
  • within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
  • awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')

General idea:

chars_to_strings() {
    while read -r -n 15 buffer               # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
    do
        printf "%s\n" "${buffer}"
    done | awk '{print NR,FNR,length($0)}'   # replace 'print ...' with OP's replacement logic
}

Take for a test drive:

for (( i = 1; i <= 20; i++ ))
do  
    printf '\a,\b,\t,\v'
    sleep 0.1                 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings 

1 1 15                        # output starts printing right away
2 2 15                        # instead of waiting for the 'for'
3 3 15                        # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15

A variation on this idea using a named pipe:

mkfifo /tmp/pipeX

sleep infinity > /tmp/pipeX                        # keep pipe open so awk does not exit

awk '{print NR,FNR,length($0)}' < /tmp/pipeX &

chars_to_strings() {
    while read -r -n 15 buffer
    do
        printf "%s\n" "${buffer}"
    done > /tmp/pipeX
}

Take for a test drive:

for (( i = 1; i <= 20; i++ ))
do
    printf '\a,\b,\t,\v'
    sleep 0.1
done | chars_to_strings

1 1 15                        # output starts printing right away
2 2 15                        # instead of waiting for the 'for'
3 3 15                        # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15

# kill background 'awk' and/or 'sleep infinity' when no longer needed
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Nice. It forces the input to be processed by chunks for working around the regex engine limitation of not starting before encountering the record separator, and it should accelerate my implementation of `chars_to_strings`. What I don't know is how to accurately add or not add a last `\n` at the end of the output of `awk`: – Fravadona Sep 23 '22 at 07:18
  • 1
    one kludge would be to terminate `${buffer}` with a nonsensical sequence + `\n`; since you're dealing with binary/non-printing characters I'm guessing you may have some ideas on a binary sequence you wouldn't expect to see in your input stream which in turn could be tacked on the end of `${buffer}`; then in the `awk` code you just look for that sequence on the end of `$0` when determing if you should (not) add a `\n` ... ??? see comments to [this answer](https://stackoverflow.com/a/73252784) re: suggestions on said binary sequences – markp-fuso Sep 23 '22 at 13:29
1

To have NO constraint on the line length you could do something like this with GNU awk:

awk -v RS='.{1,100}' -v ORS= '{
    $0 = RT
    gsub(foo,bar)
    print
}'

That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.

Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:

$ printf '\a,\b,\t,\v' |
    awk -v RS='.{1,100}' -v ORS= '{
        $0 = RT
        gsub(/\a/,"<bell>")
        gsub(/\b/,"<backspace>")
        gsub(/\t/,"<horizontal-tab>")
        gsub(/\v/,"<vertical-tab>")
        print
    }'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>

and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :

echo $'   \t   abc xyz    \t  \a   \n\n ' | 
mawk 'gsub(/\7/,  "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
      gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>'  FS='\13'  ORS='<newline>'
   <h-tab>   abc xyz    <h-tab>  <bell>   <newline><newline> <newline>
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11