4

Given this bash script running on macOS Ventura 13.2.1 (note: I can't use the default zsh due to a limitation of other tooling)...

#!/bin/bash

# title   = My Test Script
# include = File1.swift
# include = File2.swift
# include = Source File 3.swift

function extractCommentParams {

    local key=$1
    local soughtRx="^[[:space:]]*#[[:space:]]*$key[[:space:]]*[=:][[:space:]]*(.*[^[:space:]])[[:space:]]*\$"
    local replacementRx="\1"

    while IFS= read -r line; do
    
        local lineResult=$(sed -E "s|"$soughtRx"|"$replacementRx"|i" <<< "$line")

        if [ ! "$lineResult" = "$line" ]; then
            echo "$lineResult"
        fi

    done
}

cat "$0" | extractCommentParams "include"

I get the output I would expect...

File1.swift
File2.swift
Source File 3.swift

However, if I try getting just the first filename by piping those results into the head command...

cat "$0" | extractCommentParams "include" | head -n 1

I get a broken pipe error for all but the first output.

File1.swift
write error: Broken pipe
write error: Broken pipe

In contrast, if I do something like ls -l or any other command that returns multiple lines, and pipe those results into head, it works, meaning I must be doing something wrong in my function above, but I'm not seeing what.

I also tried first storing the results into a local variable, then echoing that variable out at the end, thinking it would have closed the pipe, but I was unsuccessful getting that to work either.

So how can I suppress the broken pipe error?

Update - Possible solution

This is the solution I came up with. I'm not sure if this is considered 'good practice' or I'm doing something bad by swallowing STDERR on that echo statement.

Likewise, I'm not sure if I should be returning zero, or the return code for the failed echo statement when it happens. (I'm thinking it doesn't actually matter bc the only way it can fail is if something like head closes the pipe when it's gotten all it needs, but that means head itself was successful, and that's what's ultimately returned for $? regardless of what I return here.)

function extractCommentParams {

    local key=$1
    local soughtRx="^[[:space:]]*#[[:space:]]*$key[[:space:]]*[=:][[:space:]]*(.*[^[:space:]])[[:space:]]*\$"
    local replacementRx="\1"

    while IFS= read -r line; do
    
        local lineResult=$(sed -E "s|"$soughtRx"|"$replacementRx"|i" <<< "$line")

        if [ ! "$lineResult" = "$line" ]; then
            echo "$lineResult" 2>/dev/null   # <-- Swallow STDERR for the echo
            ECHO_RESULT=$?
            if [ $ECHO_RESULT -ne 0 ]; then  # <-- If the echo failed, then bail out of the function
                exit $ECHO_RESULT            # <-- Not sure if I should be returning 0 or $ECHO_RESULT here
            fi
        fi

    done
}
Mark A. Donohoe
  • 28,442
  • 25
  • 137
  • 286
  • How much data is generated by the thing on the left matters. If it's a small enough amount of data that it's able to complete without filling the pipeline and getting stuck, it never sees the error. – Charles Duffy Mar 17 '23 at 01:55
  • 3
    And PCRE is not "more robust" than traditional UNIX regex implementations; very much the opposite. See https://swtch.com/~rsc/regexp/regexp1.html for background on how people following Larry Wall's footsteps threw the results of decades of good computer science theory out the window, throwing out tooling with reliable performance guarantees in favor of a few shiny baubles but horrible worst-case behavior. – Charles Duffy Mar 17 '23 at 01:57
  • Does this answer your question? [Bash: Head & Tail behavior with bash script](https://stackoverflow.com/questions/26461014/bash-head-tail-behavior-with-bash-script) – pjh Mar 17 '23 at 02:01
  • (Going back to the first comment to emphasize: We need to know how many bytes of output `ls` generated with your test directory, and how many bytes your `cat` emitted). – Charles Duffy Mar 17 '23 at 02:01
  • The above output shows you exactly how many bytes `cat` outputted before it threw the error (ie after the first line), and that is substantially less than the amount of bytes that `ls -l` outputted. My guess is it’s not actually about the number of bytes but rather the newlines because it throws the error as soon as it attempts to write the second line of output, which is immediately after the pipe is closed by `head -n 1`. I’m pretty sure something in the `ls` source code stops this that I’m not taking into consideration. That’s what I’m trying to figure out. – Mark A. Donohoe Mar 17 '23 at 02:11
  • 1
    @CharlesDuffy, admittedly, perhaps I used the wrong term with 'more robust', but put another way, almost every other regex language that I've used understands the basic 'tokens' such as `\s` to represent any whitespace and `\S` to represent any non-whitespace, etc. Being able to use a common, shared, and familiar syntax (yes, potentially even at the expense of speed) is a welcome tradeoff for me. – Mark A. Donohoe Mar 17 '23 at 04:55
  • 1
    GNU `sed` probably accepts some Perlisms like `\s` and `\S`; but sticking to the proper portable syntax seems like a better solution here. – tripleee Mar 17 '23 at 06:25
  • Is there a reason you are not accepting the duplicate nomination by pjh? This seems like exactly the same question (bar the meandering into other topics, which should not be in this question anyway)? – tripleee Mar 17 '23 at 06:28
  • It's actually not the same. That person is asking if you are using `head -n5` and have code that echoes out five lines, then also echoes out three more lines, does the code that echoes out the three lines still run. That's a different topic. Additionally, they are talking about the difference between `echo` and `/bin/echo`, one being an internal command, the other being external. Again, not the same thing. I however am specifically asking how do I suppress the error that is showing up when running my command, making it match the behavior of `ls -l` and others when piping through `head`. – Mark A. Donohoe Mar 17 '23 at 06:36
  • Re-titled the question to better reflect what I'm asking. – Mark A. Donohoe Mar 17 '23 at 06:37
  • 1
    re: "I’m pretty sure something in the ls source code stops this" This is probably (almost certainly) just `ls` handling SIGPIPE gracefully. You may be using a different implementation of `ls`, but see https://github.com/coreutils/coreutils/blob/master/src/ls.c#L1562 – William Pursell Mar 17 '23 at 11:05
  • This is very close to a duplicate of [`zcat | head` causes pipefail](https://stackoverflow.com/questions/41516177/bash-zcat-head-causes-pipefail) – Charles Duffy Mar 19 '23 at 22:57
  • (BTW, if you read the paper closely enough to notice that the PCRE numbers were in seconds, while the NFA numbers where in microseconds -- a smaller performance difference might be worth the usability cost, but this is well into "prone to a bunch of easy denial-of-service attacks" space; and indeed, we _do_ see a lot of denial-of-service attacks leveraging regexes with poor worst-case behavior; see https://blog.logrocket.com/protect-against-regex-denial-of-service-redos-attacks/ as one introduction to the topic). – Charles Duffy Mar 19 '23 at 23:03
  • Yeah, I actually thought that was a pretty great write-up. I originally was just going to scan it, but it was interesting enough that it made me want to dig deeper. Still, that's about the implementation. I'm saying I just wish they had the simpler tokens so it's not so damn verbose! But hey... it works so there you go! – Mark A. Donohoe Mar 20 '23 at 00:59
  • @CharlesDuffy, I don't think this question is related to [bash zcat head causes pipefail?](https://stackoverflow.com/q/41516177/4154375). `pipefail` is not set in the code, and the problem is unwanted 'broken pipe' messages, not unexpected failures or exit statuses. Since the code produces no messages when I've tested it with multiple Base versions (3, 4, 5), and comments refer to problems trapping `SIGPIPE` signals, my best guess is that the version of Bash where the problem is occurring has a bug that is causing it to mishandle `SIGPIPE` signals. – pjh Mar 20 '23 at 17:24
  • Related insofar as that link explains why the command on the left-hand side is justified/legitimate/correct to call the case a failure. Yes, the pipefail bits don't apply, but those aren't the parts I meant to draw attention to. – Charles Duffy Mar 24 '23 at 17:35

1 Answers1

1

I can't reproduce the problem using the code in the question (with the modified last line) in Bash version 3, 4, or 5 on Linux or Cygwin. In all tests the code produces a single line of output:

File1.swift

However, I get different output if I add this line at the start of the code:

trap '' SIGPIPE

Then the output becomes:

File1.swift
testprog: line 21: echo: write error: Broken pipe
testprog: line 21: echo: write error: Broken pipe

I think the most likely cause of your problem is that the shell running the program is ignoring the SIGPIPE signal. The default action on receiving the SIGPIPE signal is to exit immediately, and silently. That's why programs like ls, and the unmodified Bash code, behave as they do.

Was the code in the question cut down from a larger body of code that includes trap '' SIGPIPE (or similar, e.g. trap "" PIPE) somewhere in it? If it was, you can make the problem go away by simply removing or disabling that line of code.

If you are seeing the problem with exactly the code in the question, then something non-obvious is causing the default SIGPIPE handling to be disabled. One possibility is a configuration file that is being read due to an environment setting. You might be able to prevent that by changing the shebang line to

#!/bin/bash -p

The -p option disables the use of some environment variables that can affect how Bash programs behave. I always use it to reduce the chance of surprises.

If that doesn't help, you could try explicitly restoring the default handling of SIGPIPE by putting this at the start of the code:

trap - SIGPIPE

If that doesn't work, a possible cause of your problem is that your Bash has been built to ignore SIGPIPE signals by default. It seems unlikely that that is normal for Bash on macOS (I'd expect to see your question being asked often if it was) but I don't have access to a macOS system for testing. The easiest workaround would be to create a SIGPIPE handler that explicitly exits:

trap 'exit 1' SIGPIPE

If the code is part of a program that needs to have SIGPIPE ignored elsewhere, then you will need to change the implementation of the function. The simplest option may be to run the body of the function in a subshell and set default SIGPIPE handling in that subshell:

function extractCommentParams {
    (
        trap - SIGPIPE

        local key=$1
        ...
    )
}

Note the use of parentheses (aka round brackets), which create a subshell, instead of braces (aka curly brackets). That will create an extra subprocess when the program is run, but it shouldn't make it noticeably slower given that the code runs a subprocess (sed) per line of input.

Another option is to use sed to do all of the work, not run it line-by-line:

#!/bin/bash -p

# title   = My Test Script
# include = File1.swift
# include = File2.swift
# include = Source File 3.swift

function extractCommentParams
{
    local -r key=$1

    local keyEscaped
    keyEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$key")
    local -r s='[[:space:]]'
    local -r S='[^[:space:]]'
    local -r soughtRx="^$s*#$s*$keyEscaped$s*[=:]$s*\\(.*$S\\)$s*\$"

    sed -n "s/$soughtRx/\\1/ip"
}

cat "$0" | extractCommentParams "include" | head -n 1
  • This works regardless of the SIGPIPE handling in the shell because the SIGPIPE signal is seen only by sed, not the shell.
  • See Is it possible to escape regex metacharacters reliably with sed for an explanation of keyEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$key").
  • For maximum portability (/robustness?) I used only POSIX sed features (so no -E option and no \s or \S patterns). I used variables s and S in an attempt to make the regular expression more readable. I suspect that some people wouldn't actually find it more readable.

In most cases, it's better to use sed, or other specialist tools, instead of shell loops to process text. See Why is using a shell loop to process text considered bad practice? for more information on this topic. If, for reasons not obvious in the question, you are sure that you really need to use a shell loop for what you are doing, then you can make it much faster by using the built-in regular expression support in Bash rather than running sed on each line. Here is one way to do it:

function extractCommentParams
{
    local -r key=$1

    local -r preKeyRx="^[[:space:]]*#[[:space:]]*"
    local -r postKeyRx="[[:space:]]*[=:][[:space:]]*(.*[^[:space:]])[[:space:]]*\$"
    local line
    while IFS= read -r line || [[ -n $line ]]; do
        if [[ $line =~ $preKeyRx"$key"$postKeyRx ]]; then
            printf '%s\n' "${BASH_REMATCH[1]}"
        fi
    done
}

It appears from information in the comments that the SIGPIPE signal may not be disabled but is not being handled properly. I guess there may be a bug in the version of Bash being used. It also appears from the comments that SIGPIPE signals are being handled correctly in programs called from the shell (e.g. ls). If that is the case, one workaround for the problem is to ensure that all output is done by external programs. The sed-only solution above should just work. The line-by-line solution can be modified to use the printf program instead of the printf builtin. One way to do it is:

function extractCommentParams
{
    local -r key=$1

    local -r preKeyRx="^[[:space:]]*#[[:space:]]*"
    local -r postKeyRx="[[:space:]]*[=:][[:space:]]*(.*[^[:space:]])[[:space:]]*\$"
    local outputs=()
    local line
    while IFS= read -r line || [[ -n $line ]]; do
        if [[ $line =~ $preKeyRx"$key"$postKeyRx ]]; then
            outputs+=( "${BASH_REMATCH[1]}" )
        fi
    done

    if (( ${#outputs[*]} > 0 )); then
        env printf '%s\n' "${outputs[@]}"
    fi
}
  • To minimize performance problems due to running a subprocess for every line of output, collect the output lines in an array (outputs) and print them all together when the input has been fully processed.
  • Using env printf ... causes the printf command to be found on the PATH instead of using the builtin.

Finally, if you really need to use a loop and printf to produce the output, and you really need to have SIGPIPE ignored, then you will need to handle output errors explicitly.

Unconditionally "swallowing STDERR" would not be a good idea because output can fail for reasons other than a broken pipe (filesystem full, file reached a size limit, network file became inaccessible, ...). This is a (somewhat rough-and-ready) attempt at an acceptable way to to it:

function extractCommentParams
{
    local -r key=$1

    local -r preKeyRx="^[[:space:]]*#[[:space:]]*"
    local -r postKeyRx="[[:space:]]*[=:][[:space:]]*(.*[^[:space:]])[[:space:]]*\$"

    exec 3>&1
    local line
    local -x LC_ALL=C
    local printf_stderr
    while IFS= read -r line || [[ -n $line ]]; do
        if [[ $line =~ $preKeyRx"$key"$postKeyRx ]]; then
            if ! printf_stderr=$(printf '%s\n' "${BASH_REMATCH[1]}" 2>&1 1>&3); then
                [[ $printf_stderr == *'Broken pipe' ]]  \
                    || printf '%s\n' "$printf_stderr" >&2
                exec 3>&-
                return 1
            fi
        fi
    done
    exec 3>&-
}
  • exec 3>&1 associates file descriptor 3 with the standard output of the function, so it can be accessed from the command substitution below. This is dangerous because it may clash with use of file descriptor 3 elsewhere in the real code. Bash 4 has a mechanism for ensuring that file descriptors don't clash, but I assume this code has to work with the standard Bash (version 3) on macOS.
  • exec 3>&- later in the code closes file descriptor 3.
  • local -x LC_ALL=C sets the locale in the function, and everything called from it, to the C/POSIX locale. This is to try to ensure that error messages will be consistent (ASCII text, English) on all systems so pattern matching against them has some chance of working.
  • printf ... 2>&1 1>&3 redirects the standard error of printf so it is captured by $(...) and redirects the standard output to the standard output of the function.
  • The following code prints the error message output by printf unless it refers to a broken pipe.
  • The return 1 is there because functions should always return non-zero values in case of error. Non-zero exit status can make a difference even within a pipeline because set -o pipefail may be in effect and cause the pipeline status to be taken from the first stage to exit with non-zero status. Even if pipefail is not used the exit statuses of pipeline stages can be inspected in the PIPESTATUS builtin array.
pjh
  • 6,388
  • 2
  • 16
  • 17
  • First, huge thanks for taking such a long time to write out such a detailed answer. Love people like you on here! As for the basic points, this is just bash on macOS with all the defaults. Nothing fancy. And the code above is complete. I could never get trapping of SIGPIPE to work so I just swallowed it in my output, then checked if the result of the echo was non-zero and manually bailed. Seems to work, although I'm not crazy about it. Also saw SIGPIPE traps aren't instant and some may still flow through, but not being able to get them working at all, I can't test that theory. – Mark A. Donohoe Mar 20 '23 at 01:02
  • One other thing, I'm pretty sure it's not the shell doing something unexpected because if you just comment out the last line, replacing it with `ls -l` and piping that through `head` in the same shell, it works fine. Only thing I can think of is when doing that, it's spawning a new process whereas I'm using a function defined immediately above it that *doesn't* spawn a new shell. Perhaps I should move my function to its own script, then shell out to it like any other command. Just a thought (although my swallowing `echo` above does seem to work just fine.) – Mark A. Donohoe Mar 20 '23 at 01:08
  • Aaah! Just noticed your subshell comments. Didn't know you could do that! I'll try that. :) – Mark A. Donohoe Mar 20 '23 at 01:11
  • @MarkA.Donohoe, what you are seeing with `ls -l | head -1` doesn't rule out the shell doing something wrong. With `ls -l | head -1` the `SIGPIPE` signal is delivered to the `ls` process. With `func ... | head -1`, where `func` is a shell function that uses `echo` or `printf` to generate output, the signal is delivered to a shell process because `echo` and `printf` are shell builtins. If subprocesses are not affected by `SIGPIPE` handling problems then you could potentially solve your problem by running the external `echo` or `printf`. E.g. use `env echo ...` instead of `echo ...`. – pjh Mar 20 '23 at 02:20
  • Alas, the subshell didn't work on macOS. As you have it, it won't even run throwing syntax errors, but even adjusting it to `function extractCommentParams() ( ... )` still gives the pipe error. Tried adding all manners of trap statements too, but to no avail. Just went back to my swallowing it like I have in the update. Not happy about that, but it works. – Mark A. Donohoe Mar 20 '23 at 02:21
  • It would be interesting to know the unmodified status of `SIGPIPE` signal handling in the failing code. If you put `declare -p SIGPIPE` immediately after the shebang line, what (if anything) does it output? – pjh Mar 20 '23 at 02:25
  • Apologies for the syntax error. I neglected to test the function definition with Bash 3 or 4. It turns out that it only works with Bash 5. Thinking about it, I'm surprised that it works even there. Your fixed version should work on all versions. So should `function extractCommentParams { (...); }`. – pjh Mar 20 '23 at 02:35