0

Here is a bash3 associative array hack which satisfies:

  • pure bash (no fork/exec)
  • no subshell
  • constant time fetch (in terms of number of keys)
  • any metachars allowed in keys

The basic idea is to encode keys by substituting non-identifier chars with their hex value, and then use the sanitized key (with a name prefix) as a bash local var, leveraging the constant time bash name lookup.

enc-key()
{
    local key="${1}"  varname="${2:-_rval}"  prefix="${3:-_ENCKEY_}"
    local i  converted
    local -a enc_parts  convert
    local re='^([[:alnum:]]*)([^[:alnum:]]+)(.*)'
    local nonalnum
    enc_parts+=( "${prefix}" )
    while [[ $key =~ $re ]]; do
        enc_parts+=( "${BASH_REMATCH[1]}" )
        nonalnum="${BASH_REMATCH[2]}"
        key="${BASH_REMATCH[3]}"
        convert=()
        for (( i = 0; i < ${#nonalnum}; i++ )); do
            # leading ' in string signals printf to covert char to ascii
            convert+=( "'${nonalnum:$i:1}" )
        done
        printf -v converted "_%x" "${convert[@]}"
        enc_parts+=( "${converted}" )
    done
    enc_parts+=( "${key}" )
    printf -v $varname "%s" "${enc_parts[@]}"
    echo "DEBUG: final key: ${!varname}"
    return 0
}

To store:

    local key
    enc-key 'my-fine-key!' key
    local "${key}=value"

Fetch (before exiting function scope of the store):

    enc-key 'some other key' key
    fetched="${!key}"

Note that the second arg of enc-key is the name of the var into which enc-key will store the sanitized key.

The question: is there a way to do the encoding that does not involve character based traversal with many re matches along the way? Either some printf magic or var sub voodoo?

  • Bash is (usually, in my opinion) slow, wouldn't just using a separate processes a lot faster? Is the `pure bash (no fork/exec) no subshell` a hard external limit or an optimization to make code faster? – KamilCuk Jun 27 '20 at 21:07
  • KamilCuk - the cost of a single fork/exec is a full order of magnitude greater than that for calling the enc-key func above. – Money Luser Jun 27 '20 at 21:36
  • Is it worth it to keep identifier chars at all? The logic would be simpler if you hex encode everything – that other guy Jun 27 '20 at 23:10
  • 1
    Why are you forcing yourself to use `bash` 3? If this is for macOS, install a newer version of `bash` yourself or switch to `zsh`, which has shipped from Apple for many years. Either one supports real associative arrays. – chepner Jun 27 '20 at 23:32
  • that other guy - Interesting idea. Hex encoding everyting would triple the length of a key, hence reducing the max key lenght to 1/3 of what it is now, and also make error messages completely incomprehensible. But it would reduce the encoding to only 1 loop, so I will at least experiment with it. – Money Luser Jun 27 '20 at 23:50
  • chepner - you are correct: have to support MacOS, linux, and "windows subsystem for linux". Unfortuantely I cannot force users to install bash4 or zsh on their systems so I am constrained by bash 3. – Money Luser Jun 27 '20 at 23:53
  • Do you have bash 3.2 ? OK to use "global replacement" ${var//Pattern/Replacement} ? – dash-o Jun 28 '20 at 02:33

2 Answers2

0

On surface, the bash3 constraint implied legacy platform. The no-fork constraint prevent using one of the other scripting solution (awk, Perl or python, ...) that have much better support for character level transformations.

One simple alternative is to process all characters in a loop. On surface, this is faster than using RE to identify sequences non-alpha numeric which need to be manipulated.

On my (bash4) machine, this is running 2X

enc-key()
{
    local key="${1}"  varname="${2:-_rval}"  prefix="${3:-_ENCKEY_}"
    local i  converted
    local -a enc_parts  convert
    local re='^([[:alnum:]]*)([^[:alnum:]]+)(.*)'
    local nonalnum
    enc_parts+=( "${prefix}" )
    local key_len=${#key}
    for ((i=0 ; i<key_len ; i++)) do 
        local ch=${key:i:1}
        case "$ch" in
            [a-zA-Z0-9]) ;;
            *)  printf -v ch '_%02x' "'$ch'" ;;
        esac
        enc_parts+=($ch)
    done
    printf -v $varname "%s" "${enc_parts[@]}"
    #echo "DEBUG: final key: ${!varname}"
    return 0
}
dash-o
  • 13,723
  • 1
  • 10
  • 37
0

As an alternative, fast solution for both bash3 and bash4, using the "global replace" operator, available in bash3.2 and above. It looks for characters to escape, and uses global replace to encode all instances with one go.

On my machine: - 1000 calls to enc-key

  • Original solution: 0.39 sec
  • Using patterns: 0.19 sec - 52% speed up
  • Using global replace: 0.09 sec - 78% speedup

The interesting thing about the global replace is that performance is (almost) linear to the number of distinct character to escape, vs the pattern solution (linear to the length of the string), and the regex solution (linear to the number of characters to escape).

enc-key()
{
        local key="${1}" varname="${2:-_rval}" prefix="${3:-_ENCKEY_}"
        local unsafe=${key//[a-zA-Z0-9]/}
        local -i key_len=${#unsafe}
        local ch ch1

        while [ "$unsafe" ] ;do
                ch=${unsafe:0:1}
                printf -v ch1 '_%02x' "'$ch'"
                key=${key//"$ch"/"$ch1"}
                unsafe=${unsafe/"$ch"}
        done

        printf -v $varname "%s" "$prefix" "$key"
        #echo "DEBUG: final key: ${!varname}"
        return 0
}

dash-o
  • 13,723
  • 1
  • 10
  • 37
  • awesome! This is exactly the sort of thing I was hoping for from this group :D Unfortunately in testing it is choking on keys with '$' or '%' so investingating why and will update. – Money Luser Jun 29 '20 at 08:35
  • @MoneyLuser Can you share the specific key that does not work. Probably quoting issue somewhere. There was only one key to test in the question ... – dash-o Jun 29 '20 at 08:40
  • @MoneyLuser Adding quoting to the global substitution will allow '#' and '%' to work. Let me know if other magic characters need escaping – dash-o Jun 29 '20 at 08:49
  • As a side note, I've cross-posted similar solution (using '%xx' escape, instead of '_xx') to the https://stackoverflow.com/questions/296536/how-to-urlencode-data-for-curl-command/62619806#62619806 question, which I believe was the basis for you first implenetation, emphasizing the performance advantage. The longer the URL, the higher is the performance improvement – dash-o Jun 29 '20 at 08:52
  • dash-o - the problematic keys were '#,%,\' and adding quoting fixed the first 2 but '\' still hangs. Note that it fails with infinite loop. I am going to change the while to a for which will always terminate, and then look at your cross-posted solution. – Money Luser Jun 29 '20 at 16:55
  • dash-o - I had not seen Orwellophile's post as I just searched for "bash3 assoc array" and all the answers were n^2 or involved execs. So I wrote the func above but knew the real bash gurus would make it better. Adding quotes around $ch in the global sub fixes the problem with '\' so I will edit the answer above if I have permission and accept this answer although I also like the other one which is more robust. Thank you! – Money Luser Jun 29 '20 at 17:23
  • Added quotes as suggested – dash-o Jul 01 '20 at 04:47