How to create a function for split that uses only awk and accepts any string as input and as delimiter?

Question

The proposal is to be a function using only awk for splitting strings that accepts any string as a delimiter and any string as an input.

There are many, many proposals (see this example) for string splitting with bash commands, but all of them only work in specific cases and not according to our proposal.

We decided to present our code as an example, but despite being fully functional, there are several points that we think can be improved/adjusted/corrected.

Example function (f_split)

F_PRESERVE_BLANK_LINES_R=""
f_preserve_blank_lines() {
    : 'Remove "single quotes" used to prevent blank lines being erroneously removed.

    The "single quotes" are used at the beginning and end of the strings to prevent
    blank lines with no other characters in the sequence being erroneously removed.
    We do not know the reason for this side effect. This problem occurs, for example,
    in commands that involve "awk".

    Args:
        STR_TO_TREAT_P (str): String to be treated.

    Returns:
        F_PRESERVE_BLANK_LINES_R (str): String treated.
    '

    F_PRESERVE_BLANK_LINES_R=""
    STR_TO_TREAT_P=$1
    STR_TO_TREAT_P=${STR_TO_TREAT_P%?}
    F_PRESERVE_BLANK_LINES_R=${STR_TO_TREAT_P#?}
}

F_SPLIT_R=()
f_split() {
    : 'It does a "split" into a given string and returns an array.

    Args:
        TARGET_P (str): Target string to "split".
        DELIMITER_P (Optional[str]): Delimiter used to "split". If not informed the
    split will be done by spaces.

    Returns:
        F_SPLIT_R (array): Array with the provided string separated by the informed
    delimiter.
    '

    F_SPLIT_R=()
    TARGET_P=$1
    DELIMITER_P=$2
    if [ -z "$DELIMITER_P" ] ; then
        DELIMITER_P=" "
    fi

    REMOVE_N=1
    if [ "$DELIMITER_P" == "\n" ] ; then
        REMOVE_N=0
    fi

    # PROBLEM: This was the only parameter that has been a problem so far... There are
    # probably others. Maybe a scheme using "sed" would solve the problem...
    if [ "$DELIMITER_P" == "./" ] ; then
        DELIMITER_P="[.]/"
    fi

    if [ ${REMOVE_N} -eq 1 ] ; then

        # PROBLEM: Due to certain limitations we have some problems getting the output
        # of a split by awk inside an array and so we need to use "line break" (\n)
        # to succeed. Seen this, we remove the line breaks momentarily afterwards
        # we reintegrate them. The problem is that if there is a line break in the
        # "string" informed, this line break will be lost, that is, it is erroneously
        # removed in the output...
        TARGET_P=$(awk 'BEGIN {RS="dn"} {gsub("\n", "3F2C417D448C46918289218B7337FCAF"); printf $0}' <<< "${TARGET_P}")

    fi

    # PROBLEM: The replace of "\n" by "3F2C417D448C46918289218B7337FCAF" results in
    # more occurrences of "3F2C417D448C46918289218B7337FCAF" than the amount of "\n"
    # that there was originally in the string (one more occurrence at the end of
    # the string). We can not explain the reason for this side effect. The line below
    # corrects this problem...
    TARGET_P=${TARGET_P%????????????????????????????????}

    SPLIT_NOW=$(awk -F "$DELIMITER_P" '{for(i=1; i<=NF; i++){printf "%s\n", $i}}' <<< "${TARGET_P}")

    while IFS= read -r LINE_NOW ; do
        if [ ${REMOVE_N} -eq 1 ] ; then
            LN_NOW_WITH_N=$(awk 'BEGIN {RS="dn"} {gsub("3F2C417D448C46918289218B7337FCAF", "\n"); printf $0}' <<< "'${LINE_NOW}'")

            # PROBLEM: It would be perfect if we didn't need to use the function below...
            f_preserve_blank_lines "$LN_NOW_WITH_N"

            LN_NOW_WITH_N="$F_PRESERVE_BLANK_LINES_R"
            F_SPLIT_R+=("$LN_NOW_WITH_N")
        else
            F_SPLIT_R+=("$LINE_NOW")
        fi
    done <<< "$SPLIT_NOW"
}

Usage

read -r -d '' FILE_CONTENT << 'HEREDOC'
BEGIN
15

It may also be helpful to note (though understandably you had no room to do so) that the -d option to readarray first appears in Bash 4.4. – 
fbicknel
 Aug 18, 2017 at 15:57
4

Great answer (+1). If you change your awk to awk '{ gsub(/,[ ]+|$/,"\0"); print }' ./  and eliminate that concatenation of the final ", " then you don't have to go through the gymnastics on eliminating the final record. So: readarray -td '' a < <(awk '{ gsub(/,[ ]+/,"\0"); print; }' <<<"$string") on Bash that supports readarray. Note your method is Bash 4.4+ I think because of the -d in readarray – 
dawg
 Nov 26, 2017 at 22:28 
10

Wow, what a brilliant answer! Hee hee, my response: ditched the bash script and fired up python! – 
artfulrobot
 May 14, 2018 at 11:32
11

I'd move your right answers up to the top, I had to scroll through a lot of rubbish to find out how to do it properly :-) – 
paxdiablo
 Jan 9, 2020 at 12:31
44

This is exactly the kind of thing that will convince you to never code in bash. An astoundingly simple task that has 8 incorrect solutions. Btw, this is without a design constraint of, "Make it as obscure and finicky as possible"
END
HEREDOC
FILE_CONTENT="${FILE_CONTENT:6:-3}"
DELIMITER_P="int }' ./  and eliminate"
f_split "$FILE_CONTENT" "$DELIMITER_P"
LENGTH=${#F_SPLIT_R[*]}
for ((i=0;i<=$(($LENGTH-1));i++)); do
    echo ">>>>>>>>>>"
    echo "${F_SPLIT_R[$i]}"
    echo "<<<<<<<<<<"
done

`awk` is programming language, `sed` is also one if you are using requirement of being Turing complete https://catonmat.net/proof-that-sed-is-turing-complete `perl` and `python` commands are available out-of-the-box in numerous modern linux distros. You should be implicit about which tools are allowed and which are not, that is furnish either **complete** list of tools which are not verboten or **complete** list of tools which are verboten — Daweo, Aug 02 '22 at 06:51
I believe that a function with the characteristics pointed out would help a lot of people who use BASH. This split process with basic BASH features is always a drama... — Eduardo Lucio, Aug 02 '22 at 17:01
Anyway, here's my contribution. The above code can and should be used unrestrictedly. — Eduardo Lucio, Aug 02 '22 at 17:04
@Daweo Thanks so much for the info! I didn't know! If you get a solution using sed and awk, for example, it will be perfect, **feel free, any contribution is valid**. We have scenarios with very limited containers, so these are components that we know will always exist. — Eduardo Lucio, Aug 02 '22 at 17:12
@tripleee License has been changed to CC BY-SA 4.0 . https://creativecommons.org/licenses/by-sa/4.0/ . — Eduardo Lucio, Aug 02 '22 at 17:25
`WE ARE LOOKING FOR TWO SOLUTIONS (ANSWERS): Solutions to the identified problems;` What is the problem? `Or any other BASH implementation that meets the requirements.` 3rd party code recommendations are offtopic for stackoverflow. I do not understand why are you writing here. — KamilCuk, Aug 02 '22 at 17:33
`PROBLEM: This was the only parameter that has been a problem so far.` ? What is the actual "problem" here? In the spirit of this forum, please ask a question, like "how to ... do something"? I do not understand - how exactly do you want to split the input? What are the rules for splitting? `universal" function for splitting strings, that is, it accepts any string as a delimiter and any string as an input ("target"` Could you post some _simple_ input&output examples? `we don't want to use any programming language` Both sed and awk are _external_ programming languages. Just use python. — KamilCuk, Aug 02 '22 at 17:35
@KamilCuk The pointed items generate performance problems in the code and end up being workarounds. So it's not a recommendation, but a fix for problems that we couldn't find a solution. Thanks! — Eduardo Lucio, Aug 02 '22 at 17:41
Just glancing at your code; you cannot set `RS` to a multi-byte string if you want to have code that is "universal". (See https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_03, note "If RS contains more than one character, the results are unspecified.") — William Pursell, Aug 02 '22 at 17:41
`generate performance problems` The performance problems is that you are forking `awk` in the loop. Just write it mostly in awk if you intent to use it. — KamilCuk, Aug 02 '22 at 17:41
@KamilCuk So how could I get awk to return this to a bash array without the problems we're dealing with? That could be an answer. — Eduardo Lucio, Aug 02 '22 at 17:44
You output a zero separated stream from awk and use `mapfile` to read the stream into bash array. — KamilCuk, Aug 02 '22 at 17:45

KamilCuk · Answer 1 · 2022-08-02T18:26:02.590

"universal" function for splitting strings, that is, it accepts any string as a delimiter and any string as an input ("target")

A solution with zero byte that should work most of the time:

mysplit() {
   mapfile -t -d '' "$2" < <(sed "s/$1/\x00/g")
}
mysplit SEPARATOR result <<<"string SEPARATOR anotherstring SEPARATORstring"
declare -p result

Outputs:

declare -a result=([0]="string " [1]=" anotherstring " [2]=$'string\n')

Users have to be aware that separator is passed to sed, which is nice, allows using regex and sed escapes, or you can add code from Escape a string for a sed replace pattern .

The following code compares mysplit with very crude escaping the pattern for sed, with presented f_split for presented input:

F_PRESERVE_BLANK_LINES_R=""
f_preserve_blank_lines() {
    : 'Remove "single quotes" used to prevent blank lines being erroneously removed.

    The "single quotes" are used at the beginning and end of the strings to prevent
    blank lines with no other characters in the sequence being erroneously removed.
    We do not know the reason for this side effect. This problem occurs, for example,
    in commands that involve "awk".

    Args:
        STR_TO_TREAT_P (str): String to be treated.

    Returns:
        F_PRESERVE_BLANK_LINES_R (str): String treated.
    '

    F_PRESERVE_BLANK_LINES_R=""
    STR_TO_TREAT_P=$1
    STR_TO_TREAT_P=${STR_TO_TREAT_P%?}
    F_PRESERVE_BLANK_LINES_R=${STR_TO_TREAT_P#?}
}

F_SPLIT_R=()
f_split() {
    : 'It does a "split" into a given string and returns an array.

    Args:
        TARGET_P (str): Target string to "split".
        DELIMITER_P (Optional[str]): Delimiter used to "split". If not informed the
    split will be done by spaces.

    Returns:
        F_SPLIT_R (array): Array with the provided string separated by the informed
    delimiter.
    '

    F_SPLIT_R=()
    TARGET_P=$1
    DELIMITER_P=$2
    if [ -z "$DELIMITER_P" ] ; then
        DELIMITER_P=" "
    fi

    REMOVE_N=1
    if [ "$DELIMITER_P" == "\n" ] ; then
        REMOVE_N=0
    fi

    # PROBLEM: This was the only parameter that has been a problem so far... There are
    # probably others. Maybe a scheme using "sed" would solve the problem...
    if [ "$DELIMITER_P" == "./" ] ; then
        DELIMITER_P="[.]/"
    fi

    if [ ${REMOVE_N} -eq 1 ] ; then

        # PROBLEM: Due to certain limitations we have some problems getting the output
        # of a split by awk inside an array and so we need to use "line break" (\n)
        # to succeed. Seen this, we remove the line breaks momentarily afterwards
        # we reintegrate them. The problem is that if there is a line break in the
        # "string" informed, this line break will be lost, that is, it is erroneously
        # removed in the output...
        TARGET_P=$(awk 'BEGIN {RS="dn"} {gsub("\n", "3F2C417D448C46918289218B7337FCAF"); printf $0}' <<< "${TARGET_P}")

    fi

    # PROBLEM: The replace of "\n" by "3F2C417D448C46918289218B7337FCAF" results in
    # more occurrences of "3F2C417D448C46918289218B7337FCAF" than the amount of "\n"
    # that there was originally in the string (one more occurrence at the end of
    # the string). We can not explain the reason for this side effect. The line below
    # corrects this problem...
    TARGET_P=${TARGET_P%????????????????????????????????}

    SPLIT_NOW=$(awk -F "$DELIMITER_P" '{for(i=1; i<=NF; i++){printf "%s\n", $i}}' <<< "${TARGET_P}")

    while IFS= read -r LINE_NOW ; do
        if [ ${REMOVE_N} -eq 1 ] ; then
            LN_NOW_WITH_N=$(awk 'BEGIN {RS="dn"} {gsub("3F2C417D448C46918289218B7337FCAF", "\n"); printf $0}' <<< "'${LINE_NOW}'")

            # PROBLEM: It would be perfect if we didn't need to use the function below...
            f_preserve_blank_lines "$LN_NOW_WITH_N"

            LN_NOW_WITH_N="$F_PRESERVE_BLANK_LINES_R"
            F_SPLIT_R+=("$LN_NOW_WITH_N")
        else
            F_SPLIT_R+=("$LINE_NOW")
        fi
    done <<< "$SPLIT_NOW"
}

read -r -d '' FILE_CONTENT << 'HEREDOC'
BEGIN
15

It may also be helpful to note (though understandably you had no room to do so) that the -d option to readarray first appears in Bash 4.4. – 
fbicknel
 Aug 18, 2017 at 15:57
4

Great answer (+1). If you change your awk to awk '{ gsub(/,[ ]+|$/,"\0"); print }' ./  and eliminate that concatenation of the final ", " then you don't have to go through the gymnastics on eliminating the final record. So: readarray -td '' a < <(awk '{ gsub(/,[ ]+/,"\0"); print; }' <<<"$string") on Bash that supports readarray. Note your method is Bash 4.4+ I think because of the -d in readarray – 
dawg
 Nov 26, 2017 at 22:28 
10

Wow, what a brilliant answer! Hee hee, my response: ditched the bash script and fired up python! – 
artfulrobot
 May 14, 2018 at 11:32
11

I'd move your right answers up to the top, I had to scroll through a lot of rubbish to find out how to do it properly :-) – 
paxdiablo
 Jan 9, 2020 at 12:31
44

This is exactly the kind of thing that will convince you to never code in bash. An astoundingly simple task that has 8 incorrect solutions. Btw, this is without a design constraint of, "Make it as obscure and finicky as possible"
END
HEREDOC
FILE_CONTENT="${FILE_CONTENT:6:-3}"
DELIMITER_P="int }' ./  and eliminate"
f_split "$FILE_CONTENT" "$DELIMITER_P"
echo "f_split result:"
declare -p F_SPLIT_R


mysplit() {
   mapfile -t -d '' "$2" < <(sed "s/$1/\x00/g")
}

mysplit "$(printf "%s" "$DELIMITER_P" | sed 's~[\./]~\\&~g')" F_SPLIT_R2 < <(printf "%s" "$FILE_CONTENT")
echo "mysplit result:"
declare -p F_SPLIT_R2

echo "difference:"
diff <(printf "%s\n" "${F_SPLIT_R[@]}") <(printf "%s\n" "${F_SPLIT_R2[@]}") && echo none

And outputs:

f_split result:
declare -a F_SPLIT_R=([0]=$'15\n\nIt may also be helpful to note (though understandably you had no room to do so) that the -d option to readarray first appears in Bash 4.4. – \nfbicknel\n Aug 18, 2017 at 15:57\n4\n\nGreat answer (+1). If you change your awk to awk \'{ gsub(/,[ ]+|$/,"\\0"); pr' [1]=$' that concatenation of the final ", " then you don\'t have to go through the gymnastics on eliminating the final record. So: readarray -td \'\' a < <(awk \'{ gsub(/,[ ]+/,"\\0"); print; }\' <<<"$string") on Bash that supports readarray. Note your method is Bash 4.4+ I think because of the -d in readarray – \ndawg\n Nov 26, 2017 at 22:28 \n10\n\nWow, what a brilliant answer! Hee hee, my response: ditched the bash script and fired up python! – \nartfulrobot\n May 14, 2018 at 11:32\n11\n\nI\'d move your right answers up to the top, I had to scroll through a lot of rubbish to find out how to do it properly :-) – \npaxdiablo\n Jan 9, 2020 at 12:31\n44\n\nThis is exactly the kind of thing that will convince you to never code in bash. An astoundingly simple task that has 8 incorrect solutions. Btw, this is without a design constraint of, "Make it as obscure and finicky as possible"\n')
mysplit result:
declare -a F_SPLIT_R2=([0]=$'15\n\nIt may also be helpful to note (though understandably you had no room to do so) that the -d option to readarray first appears in Bash 4.4. – \nfbicknel\n Aug 18, 2017 at 15:57\n4\n\nGreat answer (+1). If you change your awk to awk \'{ gsub(/,[ ]+|$/,"\\0"); pr' [1]=$' that concatenation of the final ", " then you don\'t have to go through the gymnastics on eliminating the final record. So: readarray -td \'\' a < <(awk \'{ gsub(/,[ ]+/,"\\0"); print; }\' <<<"$string") on Bash that supports readarray. Note your method is Bash 4.4+ I think because of the -d in readarray – \ndawg\n Nov 26, 2017 at 22:28 \n10\n\nWow, what a brilliant answer! Hee hee, my response: ditched the bash script and fired up python! – \nartfulrobot\n May 14, 2018 at 11:32\n11\n\nI\'d move your right answers up to the top, I had to scroll through a lot of rubbish to find out how to do it properly :-) – \npaxdiablo\n Jan 9, 2020 at 12:31\n44\n\nThis is exactly the kind of thing that will convince you to never code in bash. An astoundingly simple task that has 8 incorrect solutions. Btw, this is without a design constraint of, "Make it as obscure and finicky as possible"\n')
difference:
none

I did some tests here and saw that the proposed function does not show the expected results. Look at the "USAGE" section and replace `f_split "$FILE_CONTENT" "$DELIMITER_P"` with your code. Note that the output of *f_split* gives different results. Also note "it accepts any string as a delimiter and any string as an input". — Eduardo Lucio, Aug 02 '22 at 18:10
I do not understand. The results are exactly the same, minus `BEGIN` and `END`. The only difference I see is the trailing newline, but it comes from the `<<<` input. `Look at the "USAGE" section` I think shuffling parameters is simple enough for others to do. `Also note` But your code is incapable of handling zero bytes. To what are you referring? — KamilCuk, Aug 02 '22 at 18:16
You need to escape the parameter in *DELIMITER P* (sed) in the call `mysplit "$DELIMITER_P" F_SPLIT_R <<< "$FILE_CONTENT"` and we are having two trailing newlines. — Eduardo Lucio, Aug 02 '22 at 18:25
The solution for the trailing newline can be obtained with this line `F_SPLIT_R2[-1]=${F_SPLIT_R2[-1]%?}`. — Eduardo Lucio, Aug 02 '22 at 19:41
I'm trying to adapt your function so that we can call it like this `mysplit "$DELIMITER_P" "$FILE_CONTENT"`. How could we do this? — Eduardo Lucio, Aug 02 '22 at 19:43
Associative arrays are a Bash 4 feature, so code which uses this feature will not be "universal" enough to run e.g. on MacOS out of the box (which still ships with Bash 3 for licensing and probably other less sensible reasons). — tripleee, Aug 03 '22 at 06:49

How to create a function for split that uses only awk and accepts any string as input and as delimiter?

Example function (f_split)

Usage

1 Answers1