-1

I'm wondering about a way to categorize data based on filename with uniform format. With filenames like 1_dog_yorkshire.sh and 1_cat_persian.sh which can be represented with simple regex:

[0-9]+_[a-z]+_[a-z]+.sh

I want to make tree-like structure presented below:

1 --- dog ---- yorkshire
|  |       \
|  |        -- golden retriever
|  |
|  -- cat ---- persian
|          \
|           -- siamese
|
2 --- spider ---- tarantula

First solution that comes to mind is multidimensional associative array. However, multidimensional arrays are not supported in bash. Hashing table is also not perfect solution as iteration over hashed table in Bash can be problematic. Using XML/JSON in Bash is not possible unless it's a portable and written in bash.

In ideal scenario any piece of data should be iterable, for example: for each entry in '2', for each dog in '1' or for element in tarantula list that is in spider in '2'.

How can I build a structure which is an adequate substitute for multidimensional associative arrays in Bash, for which subtrees can be traversed and leaves can store lists?

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
zuberuber
  • 3,851
  • 2
  • 16
  • 20
  • Arguably this is duplicative of the questions about multi-dimensional arrays -- if a multidim array can be represented, after all, then an arbitrary tree surely can as well, as a multi-dimensional array can be represented as a subset of the arbitrary-tree case. Thus, if the answer to "can a 2d array be represented?" is "no", then the answer to this question is surely negative as well (as is in fact the case). – Charles Duffy Aug 31 '15 at 22:58
  • This might have been a better question, by the way, if you'd focused on your actual use cases. It's possible, for instance, to iterate over a list of variables having a given prefix, so if your goal is to walk a subtree, you could do this (however hackily) by creating environment variables with names mapped from your filenames (under a selected prefix to isolate a namespace) and asking bash to list variable names with a prefix associated with the subtree in question. – Charles Duffy Aug 31 '15 at 23:03
  • (re: mapping from keys to a list -- I can think of two ways to do that, both awful hacks: Maintain a separate associative array from key names / filenames to a printf %q-generated eval-safe string with the array's contents, or a separate associative array variable per name; bash 4.3 namerefs could be used to make some of the code involved in implementing this a touch less dangerous). – Charles Duffy Aug 31 '15 at 23:08
  • ...but really, you're better off, say, storing the content in JSON and calling jq to query it, or XML and using xmlstarlet, &c; that way you have real, native nested data structures. – Charles Duffy Aug 31 '15 at 23:08
  • @CharlesDuffy Thanks for feedback. However, I think hacks will work only if there are unique filenames. XML/JSON solution is good but can't use it for same reason I can't use python/perl: There is no guarantee that it will be on all servers script will run on. – zuberuber Aug 31 '15 at 23:17
  • I can't speak to how duplicative names would impact the various proposed hacks without knowing what behavior you actually *want* in that case. It's not clear to me that your question states what behavior you'd have there even in your ideal scenario. – Charles Duffy Aug 31 '15 at 23:20
  • By the way -- what guarantees *do* you have about platform? If you don't know you'd have perl or python, how do you know you have a bash new enough to have associative arrays? – Charles Duffy Aug 31 '15 at 23:24
  • @CharlesDuffy Only platform guarantee is hardened RHEL6/7 from which some executables such as python may not present/not executable. In ideal scenario I'd like to have possibility to iterate over any piece of data in this tree-like structure e.g. for each dog that is in '1' or for every entry in '2'. – zuberuber Aug 31 '15 at 23:30
  • Sounds doable to me, albeit (as aforementioned) via ugly hacks. Why don't you edit the question to make it more specific on what you're actually trying to accomplish, rather than focusing on the data structures you'd prefer to use? – Charles Duffy Aug 31 '15 at 23:34
  • 1
    BTW, you might also -- in editing -- explain why you want to do this in-memory rather than by querying the filesystem, which is certainly going to be the cleaner approach. – Charles Duffy Aug 31 '15 at 23:34
  • @CharlesDuffy I have rephrased question in more clear way. Is this sufficient for unmarking this as duplicate? – zuberuber Aug 31 '15 at 23:57
  • I updated the title as well. Do you agree that that update is a fit for your intent? – Charles Duffy Sep 01 '15 at 00:37
  • BTW, did you answer the question about why you want to process data in memory rather than on-disk? – Charles Duffy Sep 01 '15 at 00:38
  • Also, what's the minimum version of bash this needs to work with? If it's clear to support only 4.3+, that makes things easier. – Charles Duffy Sep 01 '15 at 01:03
  • Built the insane hacky thing. Enjoy. :) – Charles Duffy Sep 01 '15 at 01:23

1 Answers1

3

The below is a hack, but... well, that was already known. :)

Let's start with setting up a test dataset:

for f in 1_{dog_{yorkshire,"golden retriever"},cat_{persian,siamese}}.sh \
         2_spider_tarantula.sh; do
  echo "$f" >"$f"
done

We can then establish an environment variable per file, with an array of contents:

# encode name to be a valid shell variable
translate_name() {
  local -a components
  local val retval

  IFS=_ read -r -a components <<<"$1"
  for component in "${components[@]}"; do
    val=$(printf '%s' "$component" | base64 - -)
    val_eqs=${val//[!=]/}
    val_eqs_count=${#val_eqs}
    val_no_eqs=${val//=/}
    printf -v retval '%s%s_%s__' "$retval" "$val_no_eqs" "$val_eqs_count"
  done
  printf '%s\n' "${retval%__}"
}

for f in *.sh; do
  varname=$(translate_name "${f%.sh}")
  mapfile -t "CONTENT_$varname" <"$f"
done

So, then -- let's say you want to walk a subtree.

You can list the array variables associated with that subtree:

get_subtree_vars() {
  local subst varname

  varname=CONTENT_$(IFS=_; translate_name "$*")
  printf -v subst '"${!'"$varname"'@}"'
  eval 'printf  "%s\n" "'"$subst"'"'
}

...and convert them back to keys:

# given an encoded variable name, return its original name
# inverse of translate_name
get_name() {
  local varname section
  local -a sections
  for varname; do
    retval=
    varname=${varname#CONTENT_}
    varname=${varname//__/ }
    IFS=' ' read -r -a sections <<<"$varname"
    for section in "${sections[@]}"; do
      val_eqs_count=${section##*_}
      val_no_eqs=${section%_*}
      val=$val_no_eqs
      for (( i=0; i<val_eqs_count; i++ )); do
        val+="="
      done
      retval+=$(base64 -D - - <<<"$val")_
    done
    printf '%s\n' "${retval%_}"
  done
}

...and retrieve their values:

# given an encoded name, retrieve a NUL-delimited list of values stored
# this could be done much more safely with bash 4.3+ using namerefs
get_values() {
  local name cmd
  local -a values
  for name; do
    [[ $name = CONTENT_* ]] || name=CONTENT_$name
    printf -v cmd 'values=( "${%q[@]}" )' "$name" && eval "$cmd"
    printf '%s\0' "${values[@]}"
  done
}

# given a name, call a function for each leaf value associated
call_for_each() {
  local funcname=$1; shift
  while IFS= read -u 3 -r subtree_var; do
    while IFS= read -u 4 -r -d '' value; do
      "$funcname" "$value"
    done 4< <(get_values "$subtree_var")
  done 3< <(get_subtree_vars "$@")
}

Thus:

printfunc() { printf '%q\n' "$@"; }
call_for_each printfunc 1 cat

...will emit:

1_cat_siamese.sh
1_cat_persian.sh

notably, these are the data, not the metadata -- note the .sh extensions, which we stripped from the variables on creation!

As another note: The eval use in the code above should be safe from escape attempts (and thus shell injection attacks via malicious filenames) on account of the use of base64-encoding to sanitize any attempted shell escapes which might be present in filenames; the printf %q use provides an additional layer. Be careful deploying the methods above in any scenario where these guarantees aren't present.


All that said -- by reading content into memory, the above is making things really unnecessarily complex. Consider as an alternative to the above example the following self-contained code:

get_subtree_files() {
  local prefix
  local -a files
  prefix=$(IFS=_; printf '%s\n' "$*")
  files=( "$prefix"* )

  # note that the test only checks the first entry of the array
  # ...but that's good enough to detect the no-matches case.
  [[ -e $files ]] && printf '%s\0' "${files[@]}"
}

xargs -0 cat < <(get_subtree_files 1 cat)
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441