3

So, I am processing a lot of data and outputting to a file. Sed doesn't give me any progress directly, however, I know that for every line of input there must be a line of output. With that in mind, I have devised a way to track progress indirectly with sed (here's a simple example):

printf 'line: %d\n' {1..1000} | sed -un -e '0~10p' | sed -un = | xargs -d '\n' -n 1 printf '%d%%\r'

This works, but I wonder if it could be simpler.

Here is the more complicated (actual production) problem:

    local one_percent
    one_percent=$(bc -l <<< "$(wc -l "${original}"| cut -d' ' -f1) / 100")

    sed  -E -n -f "${remapper}"  "${original}" | tee "${mapping_file}" | sed -un "0~${one_percent%%.*}p"  | sed -un = | xargs -d '\n' -n 1 printf '%02d%%\r' > /dev/stderr

Can this be improved?

update Unfortunately, pv isn't available on the final vm and I don't have privs to add it.

Christian Bongiorno
  • 5,150
  • 3
  • 38
  • 76
  • 2
    Maybe with [`pv`](https://man7.org/linux/man-pages/man1/pv.1.html)? – Benjamin W. Aug 15 '23 at 02:39
  • [How to add a progress bar to a shell script?](https://stackoverflow.com/q/238073/4154375) might be useful. – pjh Aug 15 '23 at 11:56
  • _ for every line of input there must be a line of output_ : If you intend to run one or more child processes for each input line, better ensure that you don't have more than a few dozens input line; otherwise you will need a lot of patience to wait until your script finishes. – user1934428 Aug 15 '23 at 12:12

6 Answers6

4

If your sed is GNU sed you can add this progression to your sed script, something like:

sed -e '0~100w /dev/stderr' -f "${remapper}" ...

Every 100 lines, starting from line 100, we write the current line to /dev/stderr. You can then post-process /dev/stderr with, e.g., awk to count the lines and produce a progress bar.

Here is an easy to test dummy example, with a 1M lines input generated by seq. The main sed script just replaces all characters with a x (s/./x/g) and the standard output is discarded to /dev/null; replace with your actual main processing and redirection. The awk script prints cr (0x0d), the input line number (NR) on 3 digits (%03d), and a %. Its standard output is redirected to the standard error. If the shown progression is too fast to see try with n = 10M.

echo 's/./x/g' > remapper
n=1000000
((m=n/100))
seq 1 "$n" |
  sed -e '0~'"$m"'w /dev/stderr' -f remapper > /dev/null 2> \
  >( awk '{printf("\x0d%03d%%", NR)} END {print ""}' 1>&2 )
Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
2

If you're happy reading the input multiple times, as you seem to be, then with pv:

sed -E -n -f "$remapper" "$original" |
pv -l -s $(wc -l "$original" | awk NF=1) > "$mapping_file"

To avoid re-reading, you can reorder:

pv "$original" |
sed -E -n -f "$remapper" > "$mapping_file"
jhnc
  • 11,310
  • 1
  • 9
  • 26
  • I started to go down this route until I determined that `pv` isn't available on the system I am running on :( – Christian Bongiorno Aug 15 '23 at 20:22
  • @ChristianBongiorno then you could perhaps update your question to make that point clear to every potential helpers? And also you could add information about your OS. – Grobu Aug 16 '23 at 14:38
1

I also think, it's better to use the pv (pipe viewer), because pv -l -s 1000 is used to indicate that the input consists of lines (-l) and the total number of lines is 1000 (-s 1000), so the pv will then display the progress as a percentage!

something like this

printf 'line: %d\n' {1..1000} | sed -un -e '0~10p' | pv -l -s 1000 | xargs -d '\n' -n 1 printf '%d%%\r' > output.txt
Freeman
  • 9,464
  • 7
  • 35
  • 58
1

For what it's worth, here's a suggestion using tail and awk.

#! /bin/sh

set -o errexit -o nounset

InputFile='./src'
OutputFile='./output'
SED_Script='./cmd.sed'

Total="$(wc -l "$InputFile")"
Total="${Total%% *}"

sed -f "$SED_Script" "$InputFile" >"$OutputFile" &

tail --pid="$!" -f "$OutputFile" \
  | awk -v Mod="$((Total / 100))" -- 'BEGIN { FS = "\n"; Count = 0;  }
                                       !(NR % Mod) { printf "%3u%%\r", ++ Count; fflush() }'

echo 'Done!'

The idea is to leave SED work in the background, building its output file as fast as possible, unimpeded by any additional pipelines.

Progress monitoring is achieved with awk taking its input from tail (which will terminate when the SED background process has ended).

Tested with a 980,000 line long input file, with mawk and gawk under Debian 11.


Update

Ditched pv in favour of awk since the former isn't available in OP's environment.

Grobu
  • 599
  • 1
  • 11
1

I decided not to add sed commands because I don't know what you might be doing in your $remapper, and didn't want to alter your possible logging.

awk handles record counts, math and output control better than sed, so I used that for a simple single-process wrapper.

$: awk 'NR==FNR{onepct=NR/100} # define how many records is one percent
        NR>FNR && 0==FNR%onepct { printf "\r%02d%%", ++pctcnt } 
        END{printf "\n"}       # add a clean newline at the end
       ' "$original" <( sed -un -f "$remapper" "$original" )

It could certainly be improved, but the awk silently scans though your original file once to get the number of lines and assign what one percent of those would be. It then reads through the output of your sed command as-is, and uses modulo division to only print once on each percent-without-remainder line. I used a simple increment for percentage output.

Keeping the operations in memory and generating fewer processes and less overall IO should speed things up if it's a sizeable dataset.

A simple pipe works just as well, just use a dash as the second input:

$: sed -un -f "$remapper" "$original" |
   awk 'NR==FNR{onepct=FNR/100} # define how many records is one percent
        NR>FNR && 0==NR%onepct { printf "\r%02d%%", ++pctcnt } 
        END{printf "\n"}       # add a clean newline at the end
       ' "$original" -

If you prefer to handle the record counting with wc, it's even simpler.

awk -v pct=$(($(wc -l<"$original")/100)) '0==NR%pct{printf "\r%02d%%",++pctcnt} END{printf "\n"}' <(sed -unf "$remapper" "$original")

or

sed -unf "$remapper" "$original" |
 awk -v pct=$(($(wc -l<"$original")/100)) '0==NR%pct{printf "\r%02d%%",++pctcnt} END{printf "\n"}' 

Of course, to preserve your output you either need the tee or could have awk do it.

sed -unf "$remapper" "$original" | tee "${mapping_file}" |
 awk -v pct=$(($(wc -l<"$original")/100)) '0==NR%pct{printf "\r%02d%%",++pctcnt} END{printf "\n"}' 

or

sed -unf "$remapper" "$original" | 
 awk -v pct=$(($(wc -l<"$original")/100)) -v out="$mapping_file" '
   { print > out }
   0==NR%pct { printf "\r%02d%%", ++pctcnt }
   END { printf "\n"; close(out); } ' 
Paul Hodges
  • 13,382
  • 1
  • 17
  • 36
  • Be aware that you are getting integer math for the percentage, and there may be a few records processed at the end after the `100%` prints but before it finishes. Likewise, saying `%02d`will force *at least* two digits, but the final `100%` message exceeds the spec and is all printed. That's unlikely to be an issue, but better to be aware of such details. – Paul Hodges Aug 15 '23 at 17:28
  • Yeah, I thought about that already and made some adjustments. Thanks – Christian Bongiorno Aug 16 '23 at 16:13
1

Since pv is apparently unavailable, here's an approach using the status=progress feature of GNU dd, which might be more available, and any awk.

It can be dropped into a command to replace the (single) input filename. Probably can't be used sensibly with piped input since total size can't be calculated.

Define a bash function (could also be a bash shell script):

show_progress()(
    # $1 is a file to read

    # add error checking here

    # stat can be approximated by "du -sk" scaled appropriately
    sz=$(stat -c%s "$1")

    dd status=progress if="$1" 2> >(
        awk -v sz=$sz -v RS=$'\r' '
            BEGIN { pc = 100/sz }
            $2=="bytes" {
                printf "%.2f%%\r", $1*pc
                fflush()
            }
        ' 1>&2
    ) &&
    echo "100.00%" 1>&2
)

Use in a bash commandline as:

sed -E -n -f "$remapper" <(show_progress "$original") >"$mapping_file"

or (implemented as a bash script) with any POSIX shell:

show_progress "$original" |
sed -E -n -f "$remapper" >"$mapping_file"

The >(...) is bash process substitution. 2> >(...) redirects stderr to the named pipe (the space is important).

The nice thing is that the input file does not have to be scanned multiple times, nor the output scanned.

Note that there is a potential lag after final progress message is printed (after input is consumed), if processing is slow.

jhnc
  • 11,310
  • 1
  • 9
  • 26
  • I'm wrong. Using `dd` is obviously an extra pass over the input. – jhnc Aug 16 '23 at 16:45
  • To have just a single pass over the input, on linux one could loop reading `pos` from /proc/`pid`/fdinfo/* in a separate process. But for that, installing [`progress`](https://github.com/Xfennec/progress) probably becomes simpler to maintain, and much more powerful. – jhnc Aug 16 '23 at 17:00