How to parse YAML data into a custom Bash data array/hash structure?

Question

I have the following YAML file:

site:
  title: My blog
  domain: example.com
  author1:
    name: bob
    url: /author/bob
  author2:
    name: jane
    url: /author/jane
  header_links:
    about:
      title: About
      url: about.html
    contact:
      title: Contact Us
      url: contactus.html
  js_deps:
    - cashjs
    - jets

products:
  product1:
    name: Prod One
    price: 10
  product2:
    name: Prod Two
    price: 20

And I'd like a Bash, Python or AWK function or script that can take the YAML file above as input ($1), and generate then execute the following code (or something exactly equivalent):

unset site_title 
unset site_domain
unset site_author1
unset site_author2
unset site_header_links
unset site_header_links_about
unset site_header_links_contact
unset js_deps

site_title="My blog"
site_domain="example.com"

declare -A site_author1
declare -A site_author2

site_author1=(
  [name]="bob"
  [url]="/author/bob"
)

site_author2=(
  [name]="jane"
  [url]="/author/jane"
)

declare -A site_header_links_about
declare -A site_header_links_contact

site_header_links_about=(
  [name]="About"
  [url]="about.html"
)

site_header_links_contact=(
  [name]="Contact Us"
  [url]="contact.html"
)

site_header_links=(site_header_links_about  site_header_links_contact)

js_deps=(cashjs jets)

unset products
unset product1
unset product2

declare -A product1
declare -A product2

product1=(
  [name]="Prod One"
  [price]=10
)

product2=(
  [name]="Prod Two"
  [price]=20
)

products=(product1 product2)

So, the logic is:

Go through the YAML, and create underscore concatenated variable names with string values, except at the last (bottom) level, where data should be created as an associative array or index array, wherever possible... Also, any assoc arrays created should be listed by name, in an indexed-array.

So, in other words:

wherever the last level of data can be turned into an associative array then it should be (foo.bar.hash => ${foo_bar_hash[@]}
wherever the last level of data can be turned into an indexed array then it should be (foo.bar.list => ${foo_bar_list[@]}
every assoc array should be listed by name in an indexed array which is named after its parent in the yaml data (see products in the example)
else, just make an underscore concatenated var name and save the value as a string (foo.bar.string => ${foo_bar_string}

...The reason I need this specific Bash data structure is that I'm using a Bash-based templating system which requires it.

Once I have the function I need, I will be able to use the YAML data easily in my templates, like so:

{{site_title}}

...

{{#foreach link in site_header_links}}
  <a href="{{link.url}}">{{link.name}}</a>
{{/foreach}}

...

{{#js_deps}}
  {{.}}
{{/js_deps}}

...

{{#foreach item in products}}
  {{item.name}}
  {{item.price}}
{{/foreach}}

What I tried:

This is totally related to a previous question I asked:

How to convert a subset of YAML into an indexed array of associative arrays?

This is so close, but I need an associative array of site_header_links to be generated OK as well .. it fails because site_header_links is nested too deep.

I would still love to use https://github.com/azohra/yaml.sh in the solution, as it would provide an easy handlebars-style lookup rip-off for the templating system too :)

EDIT:

To be super clear: The solution cannot use pip, virtualenv, or any other external deps that need installing separately - it must be a self-contained script/func (like https://github.com/azohra/yaml.sh is) which can live inside the CMS project dir... or I wouldn't need to be here..

...

Hopefully, a nicely commented answer might help me avoid coming back here ;)

I't quite clear what I tried and what does not work - I state in the OP that my previous post doesn't work for me because "I need an associative array of `site_header_links` to be generated OK as well .. it fails because `site_header_links` is nested too deep" I tried lots of things, but all not working - just hacks at the previous solution that got nowhere. I don't see it as broad - I simply want what 90% of shell based YAML parsers do.. To create `_` concatenated vars - _except_ I want indexed/assoc arrays on the last level (and the assoc arrays listed by name in indexed arrays).. — sc0ttj, Aug 11 '19 at 12:12
I have summarised what I tried already by linking to another question, which is part of my journey here, and contains links to _all_ the libraries I tried to hack at, as well as what the closest solution is, and why it still doesn't quite do what I need... I don't think it's appropriate to spam my posts with all the _many, many_ failures I had, when the libraries I linked to are closer than I got anyway.. :/ — sc0ttj, Aug 11 '19 at 12:24
Why in the output there is variable `product1` and not `products_product1`? Where is there no `products_product1_name` variable? How do you decide which level get's an associative array, and which is named with underscores? Why is there `site_header_links_contact`? But no `site=([title]="My blog")`array? Why `site_title` variable and not `site` array? — KamilCuk, Aug 11 '19 at 13:38
I do not need `products_product1_name`, as I will have `product1[name]`, accessible (in my templating system) as simply `{{foreach product in products}}` .. that is why .. I don't need vars containing the stuff already in arrays .. where arrays _can_ be created at the bottom level, they should be (instead of the concatenated vars you'd otherwise have)... No `site` array because, as we all know, Bash doesn't do _multi-dimensional objects_ and it's not at the _bottom level_..... — sc0ttj, Aug 11 '19 at 20:58
So in other words, I want arrays from the bottom level, cos Bash can do that, but it can't make multi-dimensional arrays of the other levels, so listing arrays by name is a hack ... Once the data structure is in place, my templating system can iterate over things _as if they were 2 levels (or more) deep_ .. — sc0ttj, Aug 11 '19 at 21:03

sc0ttj · Accepted Answer · 2019-08-21T10:10:23.233

I have decided to use a combination of the following:

a hacked version of Yay:
- with added support for simple lists
- fixes for multiple indentation levels
a hacked version of this yaml parser:
- with prefix stuff borrowed from Yay, for consistency

function yaml_to_vars {
   # find input file
   for f in "$1" "$1.yay" "$1.yml"
   do
     [[ -f "$f" ]] && input="$f" && break
   done
   [[ -z "$input" ]] && exit 1

   # use given dataset prefix or imply from file name
   [[ -n "$2" ]] && local prefix="$2" || {
     local prefix=$(basename "$input"); prefix=${prefix%.*}; prefix="${prefix//-/_}_";
   }

   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
   sed -ne "s|,$s\]$s\$|]|" \
        -e ":1;s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s,$s\(.*\)$s\]|\1\2: [\3]\n\1  - \4|;t1" \
        -e "s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s\]|\1\2:\n\1  - \3|;p" $1 | \
   sed -ne "s|,$s}$s\$|}|" \
        -e ":1;s|^\($s\)-$s{$s\(.*\)$s,$s\($w\)$s:$s\(.*\)$s}|\1- {\2}\n\1  \3: \4|;t1" \
        -e    "s|^\($s\)-$s{$s\(.*\)$s}|\1-\n\1  \2|;p" | \
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)-$s[\"']\(.*\)[\"']$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)-$s\(.*\)$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" | \
   awk -F$fs '{
      indent = length($1)/2;
      vname[indent] = $2;
      for (i in vname) {if (i > indent) {delete vname[i]; idx[i]=0}}
      if(length($2)== 0){  vname[indent]= ++idx[indent] };
      if (length($3) > 0) {
         vn=""; for (i=0; i<indent; i++) { vn=(vn)(vname[i])("_")}
         printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, vname[indent], $3);
      }
   }'
}

yay_parse() {

   # find input file
   for f in "$1" "$1.yay" "$1.yml"
   do
     [[ -f "$f" ]] && input="$f" && break
   done
   [[ -z "$input" ]] && exit 1

   # use given dataset prefix or imply from file name
   [[ -n "$2" ]] && local prefix="$2" || {
     local prefix=$(basename "$input"); prefix=${prefix%.*}; prefix=${prefix//-/_};
   }

   echo "unset $prefix; declare -g -a $prefix;"

   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
   #sed -n -e "s|^\($s\)\($w\)$s:$s\"\(.*\)\"$s\$|\1$fs\2$fs\3|p" \
   #       -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" "$input" |
   sed -ne "s|,$s\]$s\$|]|" \
        -e ":1;s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s,$s\(.*\)$s\]|\1\2: [\3]\n\1  - \4|;t1" \
        -e "s|^\($s\)\($w\)$s:$s\[$s\(.*\)$s\]|\1\2:\n\1  - \3|;p" $1 | \
   sed -ne "s|,$s}$s\$|}|" \
        -e ":1;s|^\($s\)-$s{$s\(.*\)$s,$s\($w\)$s:$s\(.*\)$s}|\1- {\2}\n\1  \3: \4|;t1" \
        -e    "s|^\($s\)-$s{$s\(.*\)$s}|\1-\n\1  \2|;p" | \
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)-$s[\"']\(.*\)[\"']$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)-$s\(.*\)$s\$|\1$fs$fs\2|p" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" | \
   awk -F$fs '{
      indent       = length($1)/2;
      key          = $2;
      value        = $3;

      # No prefix or parent for the top level (indent zero)
      root_prefix  = "'$prefix'_";
      if (indent == 0) {
        prefix = "";          parent_key = "'$prefix'";
      } else {
        prefix = root_prefix; parent_key = keys[indent-1];
      }

      keys[indent] = key;

      # remove keys left behind if prior row was indented more than this row
      for (i in keys) {if (i > indent) {delete keys[i]}}

      # if we have a value
      if (length(value) > 0) {

        # set values here

        # if the "key" is missing, make array indexed, not assoc..

        if (length(key) == 0) {
          # array item has no key, only a value..
          # so, if we didnt already unset the assoc array
          if (unsetArray == 0) {
            # unset the assoc array here
            printf("unset %s%s; ", prefix, parent_key);
            # switch the flag, so we only unset once, before adding values
            unsetArray = 1;
          }
          # array was unset, has no key, so add item using indexed array syntax
          printf("%s%s+=(\"%s\");\n", prefix, parent_key, value);

        } else {
          # array item has key and value, add item using assoc array syntax
          printf("%s%s[%s]=\"%s\";\n", prefix, parent_key, key, value);
        }

      } else {

        # declare arrays here

        # reset this flag for each new array we work on...
        unsetArray = 0;

        # if item has no key, declare indexed array
        if (length(key) == 0) {
          # indexed
          printf("unset %s%s; declare -g -a %s%s;\n", root_prefix, key, root_prefix, key);

        # if item has numeric key, declare indexed array
        } else if (key ~ /^[[:digit:]]/) {
          printf("unset %s%s; declare -g -a %s%s;\n", root_prefix, key, root_prefix, key);

        # else (item has a string for a key), declare associative array
        } else {
          printf("unset %s%s; declare -g -A %s%s;\n", root_prefix, key, root_prefix, key);
        }

        # set root level values here

        if (indent > 0) {
          # add to associative array
          printf("%s%s[%s]+=\"%s%s\";\n", prefix, parent_key , key, root_prefix, key);
        } else {
          # add to indexed array
          printf("%s%s+=( \"%s%s\");\n", prefix, parent_key , root_prefix, key);
        }

      }
   }'
}

# helper to load yay data file
yay() {
  # yaml_to_vars "$@"  ## uncomment to debug (prints data to stdout)
  eval $(yaml_to_vars "$@")

  # yay_parse "$@"  ## uncomment to debug (prints data to stdout)
  eval $(yay_parse "$@")
}

Using the code above, when products.yml contains:

  product1
    name: Foo
    price: 100
  product2
    name: Bar
    price: 200

the parser can be called like so:

source path/to/yml-parser.sh
yay products.yml

And it generates and then evaluates this code:

products_product1_name="Foo"
products_product1_price="100"
products_product2_name="Bar"
products_product2_price="200"
unset products;
declare -g -a products;
unset products_product1;
declare -g -A products_product1;
products+=( "products_product1");
products_product1[name]="Foo";
products_product1[price]="100";
unset products_product2;
declare -g -A products_product2;
products+=( "products_product2");
products_product2[name]="Bar";
products_product2[price]="200";

So, I get the following Bash arrays and variables:

declare -a products=([0]="products_product1" [1]="products_product2")
declare -A products_product1=([price]="100" [name]="Foo" )
declare -A products_product2=([price]="200" [name]="Bar" )

And in my templating system, I can now access this yml data like so:

{{#foreach product in products}}
  Name:  {{product.name}}
  Price: {{product.price}}
{{/foreach}}

:)

Another example:

File site.yml

meta_info:
  title: My cool blog
  domain: foo.github.io
author1:
  name: bob
  url: /author/bob
author2:
  name: jane
  url: /author/jane
header_links:
  link1:
    title: About
    url: about.html
  link2:
    title: Contact Us
    url: contactus.html
js_deps:
  cashjs: cashjs
  jets: jets
Foo:
  - one
  - two
  - three

Produces:

declare -a site=([0]="site_meta_info" [1]="site_author1" [2]="site_author2" [3]="site_header_links" [4]="site_js_deps" [5]="site_Foo")
declare -A site_meta_info=([title]="My cool blog" [domain]="foo.github.io" )
declare -A site_author1=([url]="/author/bob" [name]="bob" )
declare -A site_author2=([url]="/author/jane" [name]="jane" )
declare -A site_header_links=([link1]="site_link1" [link2]="site_link2" )
declare -A site_link1=([url]="about.html" [title]="About" )
declare -A site_link2=([url]="contactus.html" [title]="Contact Us" )
declare -A site_js_deps=([cashjs]="cashjs" [jets]="jets" )
declare -a site_Foo=([0]="one" [1]="two" [2]="three")

In my templates, I can access site_header_links like so:

{{#foreach link in site_header_links}}
  * {{link.title}} - {{link.url}}
{{/foreach}}

and site_Foo (a dash-notation, or simple list) like so:

{{#site_Foo}}
  * {{.}}
{{/site_Foo}}

Anthon · Answer 2 · 2019-08-12T05:49:08.603

It is difficult to see what the rules of a cardgame are by just looking at people playing one round. And in a similar way it is difficult to see exactly what the "rules" of your YAML file are.

In the following I have made assumptions about the root-level as well as the first-, second- and third- level nodes and what output they generate. It would also be valid to make assumptions about a node based on the level op parents it has, that is more flexible (as you then can just add e.g. a sequence at the root level), but that would be somewhat more difficult to implement.

Keeping the declares and compound array assigments interspersed with the other code and grouped for "similar" items is kind of cumbersome. For that you would need to keep track of transitions of types of nodes (str, dict, nested dict) and group on that. So per root level key I dump all unset first, then all declares, then all assignments and then al compound assignments. I think that falls under "something exactly equivalent".

Since products -> product1/product2 is handled completely different from site -> author1/authro2 that have the same node structure, I made a separate function to handle each root level key.

To get this to run you should set up a virtualenvironment for Python (3.7/3.6), install the YAML library in that:

$ python -m venv /opt/util/yaml2bash
$ /opt/util/yaml2bash/bin/pip install ruamel.yaml

Then store the following program e.g. in /opt/util/yaml2bash/bin/yaml2bash and make it executable (chmod +x /opt/util/yaml2bash/bin/yaml2bash)

#! /opt/util/yaml2bash/bin/python

import sys
from pathlib import Path
import ruamel.yaml

if len(sys.argv) > 0:
    input = Path(sys.argv[1])
else:
    input = sys.stdin


def bash_site(k0, v0, fp):
    """this function takes a root-level key and its value (v0 a dict), constructs the 
    list of unsets and outputs based on the keys, values and type of values of v0,
    then dumps these to fp
    """
    unsets = []
    declares = []
    assignments = []
    compounds = {}
    for k1, v1 in v0.items():
        if isinstance(v1, str):
            k = k0 + '_' + k1
            unsets.append(k)
            assignments.append(f'{k}="{v1}"')
        elif isinstance(v1, dict):
            first_val = list(v1.values())[0]
            if isinstance(first_val, str):
                k = k0 + '_' + k1
                unsets.append(k)
                declares.append(k)
                assignments.append(f'{k}=(')
                for k2, v2 in v1.items():
                    q = '"' if isinstance(v2, str) else ''
                    assignments.append(f'  [{k2}]={q}{v2}{q}')
                assignments.append(')')
            elif isinstance(first_val, dict):
                for k2, v2 in v1.items(): # assume all the same type
                    k = k0 + '_' + k1 + '_' + k2   
                    unsets.append(k)
                    declares.append(k)
                    assignments.append(f'{k}=(')
                    for k3, v3 in v2.items():
                        q = '"' if isinstance(v3, str) else ''
                        assignments.append(f'  [{k2}]={q}{v3}{q}')
                    assignments.append(')')
                    compounds.setdefault(k0 + '_' + k1, []).append(k)
            else:
                raise NotImplementedError("unknown val: " + repr(first_val))
        elif isinstance(v1, list):
            unsets.append(k1)
            compounds[k1] = v1
        else:
            raise NotImplementedError("unknown val: " + repr(v1))


    if unsets:
        for item in unsets:
            print('unset', item, file=fp)
        print(file=fp)
    if declares:
        for item in declares:
            print('declare -A', item, file=fp)
        print(file=fp)
    if assignments:
        for item in assignments:
            print(item, file=fp)
        print(file=fp)
    if compounds:
        for k in compounds:
            v = ' '.join(compounds[k])
            print(f'{k}=({v})', file=fp)
        print(file=fp)


def bash_products(k0, v0, fp):
    """this function takes a root-level key and its value (v0 a dict), constructs the 
    list of unsets and outputs based on the keys, values and type of values of v0,
    then dumps these to fp
    """
    unsets = [k0]
    declares = []
    assignments = []
    compounds = {}
    for k1, v1 in v0.items():
        if isinstance(v1, dict):
            first_val = list(v1.values())[0]
            if isinstance(first_val, str):
                unsets.append(k1)
                declares.append(k1)
                assignments.append(f'{k1}=(')
                for k2, v2 in v1.items():
                    q = '"' if isinstance(v2, str) else ''
                    assignments.append(f'  [{k2}]={q}{v2}{q}')
                assignments.append(')')
                compounds.setdefault(k0, []).append(k1)
            else:
                raise NotImplementedError("unknown val: " + repr(first_val))
        else:
            raise NotImplementedError("unknown val: " + repr(v1))


    if unsets:
        for item in unsets:
            print('unset', item, file=fp)
        print(file=fp)
    if declares:
        for item in declares:
            print('declare -A', item, file=fp)
        print(file=fp)
    if assignments:
        for item in assignments:
            print(item, file=fp)
        print(file=fp)
    if compounds:
        for k in compounds:
            v = ' '.join(compounds[k])
            print(f'{k}=({v})', file=fp)
        print(file=fp)




yaml = ruamel.yaml.YAML()
data = yaml.load(input)

output = sys.stdout  # make it easier to redirect to file if necessary at some point in the future

bash_site('site', data['site'], output)
bash_products('products', data['products'], output)

if you run this program and provide your YAML input file as an argument (/opt/util/yaml2bash/bin/yaml2bash input.yaml) that gives:

unset site_title
unset site_domain
unset site_author1
unset site_author2
unset site_header_links_about
unset site_header_links_contact
unset js_deps

declare -A site_author1
declare -A site_author2
declare -A site_header_links_about
declare -A site_header_links_contact

site_title="My blog"
site_domain="example.com"
site_author1=(
  [name]="bob"
  [url]="/author/bob"
)
site_author2=(
  [name]="jane"
  [url]="/author/jane"
)
site_header_links_about=(
  [about]="About"
  [about]="about.html"
)
site_header_links_contact=(
  [contact]="Contact Us"
  [contact]="contactus.html"
)

site_header_links=(site_header_links_about site_header_links_contact)
js_deps=(cashjs jets)

unset products
unset product1
unset product2

declare -A product1
declare -A product2

product1=(
  [name]="Prod One"
  [price]=10
)
product2=(
  [name]="Prod Two"
  [price]=20
)

products=(product1 product2)

You can use do something like source $(/opt/util/yaml2bash/bin/yaml2bash input.yaml) to get all these values in bash.

Please note that all the double quotes in your YAML file are superfluous.

Using Python and ruamel.yaml (disclaimer I am the author of that package) gives you a full YAML parser, e.g. allowing you to use comments and flow-style collections:

jsdeps: [cashjs, jets]    # more compact

If you are stuck with the almost end-of-life Python 2.7 and don't have full control over your machine (in which case you should install/compile Python 3.7 for it), you still can use ruamel yaml.

Decide on where your program goes e.g. ~/bin
Create ~/bin/ruamel (adjust as per 1.)
cd ~/bin/ruamel
touch __init__.py
Download the latest tar file from PyPI
unpack the tar file and rename the resulting directory from ruamel.yaml-X.Y.Z to just yaml

ruamel.yaml should work without its dependencies. On 2.7 those are ruamel.ordereddict and ruamel.yaml.clib that provide C versions of base routines for speed-up.

The above program would need rewriting a little bit (f-strings -> "".format() and pathlib.Path -> old fashioned with open(...) as fp:

`/opt/util` is just an example directory, and you can download and install the `ruamel.yaml` tar-file anywhere your script is able to find it. — Anthon, Aug 11 '19 at 12:49
So this `ruamel.yaml` is a single script (or scripts in a single dir) with no external deps (like other pip packages, additional python modules/libs, etc) and doesn't need anything other than a "vanilla" python 2.7 install? ...If so, I could prob use that.. — sc0ttj, Aug 11 '19 at 20:48
Python 2.7 will be end-of-life on Jan 1st next year, so I am not sure doing anything with that is a good choice (but then your whole system sound like it is outdated). I updated my answer. — Anthon, Aug 12 '19 at 05:08

How to parse YAML data into a custom Bash data array/hash structure?

What I tried:

2 Answers2

Another example: