awk dynamic document indexing

Question

I have a document that I need to dynamically create/update the indexes in. I am trying to acomplish this with awk. I have a partial working example but now I'm stumped.

The example document is as follows.

numbers.txt:
    #) Title
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#) Subtitle
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#) Subtitle
    #.#.#) Section
    #.#.#.#) Subsection
    #) Title
    #) Title
    #.#) Subtitle
    #.#.#) Section
    #.#.#.#) Subsection
    #.#.#.#) Subsection

The desired output would be:

1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.2) Subsection

The awk code that I have which partially works is as follows.

numbers.sh:
    awk '{for(w=1;w<=NF;w++)if($w~/^#\)/){sub(/^#/,++i)}}1' number.txt

Any help with this would be greatly appreciated.

I think this will make a great interview question. – karakfa Dec 07 '15 at 17:55 — karakfa, Dec 07 '15 at 17:55

Facundo Victor · Answer 1 · 2015-12-07T14:17:47.543

I have implemented an AWK script for you! And it will still work for more than four level indexes! ;)

I will try to explain it a little with inline comments:

#!/usr/bin/awk -f

# Clears the "array" starting from "from"                                       
function cleanArray(array,from){                                                
    for(w=from;w<=length(array);w++){                                           
        array[w]=0                                                              
    }                                                                           
}                                                                               

# This is executed only one time at beginning.                                  
BEGIN {                                                                         
    # The key of this array will be used to point to the "text index".
    # I.E., an array with (1 2 2) means an index "1.2.2)"           
    array[1]=0      
}                                                                               

# This block will be executed for every line.                                   
{                                                                               
    # Amount of "#" found.                                                      
    amount=0                                                                    

    # In this line will be stored the result of the line.                       
    line=""                                                                     

    # Let's save the entire line in a variable to modify it.                    
    rest_of_line=$0                                                             

    # While the line still starts with "#"...                                   
    while(rest_of_line ~ /^#/){                                                 

        # We remove the first 2 characters.                                     
        rest_of_line=substr(rest_of_line, 3, length(rest_of_line))              

        # We found one "#", let's count it!                                     
        amount++                                                                

        # The line still starts with "#"?                                       
        if(rest_of_line ~ /^#/){                                                
            # yes, it still starts.                                             

            # let's print the appropiate number and a ".".                      
            line=line""array[amount]                                            
            line=line"."                                                        
        }else{                                                                  
            # no, so we must add 1 to the old value of the array.       
            array[amount]++                                                     

            # And we must clean the array if it stores more values              
            # starting from amount plus 1. We don't want to keep                
            # storing garbage numbers that may harm our accounting              
            # for the next line.                                                
            cleanArray(array,amount + 1)                                        

            # let's print the appropiate number and a ")".                      
            line=line""array[amount]                                            
            line=line")"                                                        
        }                                                                       
    }                                                                           

    # Great! We have the line with the appropiate indexes!                      
    print line""rest_of_line                                                    
}

So, if you save it as script.awk, then you can execute it adding execution permission to the file:

chmod u+x script.awk

Finally, you can execute it:

./script.awk <path_to_number.txt>

As an example, if you save the script script.awk in the same directory where is located the file number.txt, then, change directory to that directory and execute:

./script.awk number.txt

So, if you have this number.txt

#) Title
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#) Subtitle
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#) Subtitle
#.#.#) Section
#.#.#.#) Subsection
#) Title
#) Title
#.#) Subtitle
#.#.#) Section
#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#.#) Subsection
#.#.#.#.#) Subsection
#.#.#.#) Subsection
#.#.#) Section

This will be the output (Note that the solution is not limited by the amount of "#"):

1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.1.1) Subsection
7.1.1.1.2) Subsection
7.1.1.1.3) Subsection
7.1.1.1.3.1) Subsection
7.1.1.1.4) Subsection
7.1.1.1.4.1) Subsection
7.1.1.1.4.2) Subsection
7.1.1.1.4.3) Subsection
7.1.1.1.4.4) Subsection
7.1.1.1.5) Subsection
7.1.1.2) Subsection
7.1.2) Section

I hope it helps you!

Facundo, it's wonderful that you donate your time in this way. To keep the good will and usefulness going in all directions, may I recommend that you add some explanation to your answer, so that it functions as an education tool as well as merely a solution to the one problem described in the question? — ghoti, Dec 07 '15 at 04:18

karakfa · Answer 2 · 2015-12-07T17:53:16.720

awk to the rescue!

I'm not sure this is the optimal way of doing this but works...

awk    'BEGIN{d="."}
/#\.#\.#\.#/ {sub("#.#.#.#", i d a[i] d b[i d a[i]] d (++c[i d a[i] d b[i d a[i]]]))}
   /#\.#\.#/ {sub("#.#.#"  , i d a[i] d (++b[i d a[i]]))}
      /#\.#/ {sub("#.#"    , i d (++a[i]))}
         /#/ {sub("#"      , (++i))} 1'

UPDATE: The above is limited to only 4 levels. Here is a better one for unlimited number of levels

 awk '{d=split($1,a,"#")-1;                # find the depth
       c[d]++;                             # increase counter for current          
       for(i=pd+1;i<=d;i++) c[i]=1;        # reset when depth increases
       for(i=1;i<=d;i++) {sub(/#/,c[i])};  # replace digits one by one
       pd=d} 1'                            # set previous depth and print

perhaps reset steps can be combined with the main loop but I think clearer this way.

UPDATE 2:

I think with this logic, the following is the shortest possible.

$ awk '{d=split($1,_,"#")-1;      # find the depth
        c[d]++;                   # increment counter for current depth
        for(i=1;i<=d;i++)         # start replacement
           {if(i>pd)c[i]=1;       # reset the counters
            sub(/#/,c[i])         # replace placeholders with counters
           }
           pd=d} 1' file          # set the previous depth

or as a one-liner

$ awk '{d=split($1,_,"#")-1;c[d]++;for(i=1;i<=d;i++){if(i>pd)c[i]=1;sub(/#/,c[i])}pd=d}1'

bian · Answer 3 · 2015-12-07T06:22:59.220

gawk

awk 'function w(){
    k=m>s?m:s
    for(i=1;i<=k;i++){
        if(i>m){
            a[i]=0
        }
        else{
            a[i]=(i==m)?++a[i]:a[i]   #ended "#" increase
            sub("#",a[i]=a[i]?a[i]:1) 
        }
    }
    s=m
}
{m=split($1,t,"#")-1;w()}1' file



1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title
5) Title
5.1) Subtitle
5.1.1) Section
5.2) Subtitle
5.2.1) Section
5.2.1.1) Subsection
6) Title
7) Title
7.1) Subtitle
7.1.1) Section
7.1.1.1) Subsection
7.1.1.2) Subsection

score 2 · Answer 4 · answered Dec 07 '15 at 05:44

2

Same approach as @karakfa's (short and sweet) and with the same caveat about the assumed maximum number of subheadings, but a little shorter and more efficient:

awk 'BEGIN{d="."}
  /#\.#\.#\.#/ {sub("#.#.#.#", i d a d b d (++c) )}
     /#\.#\.#/ {sub("#.#.#"  , i d a d (++b) );  c=0;}
        /#\.#/ {sub("#.#"    , i d (++a));       b=0;}
           /#/ {sub("#"      , (++i));           a=0;} 1'

answered Dec 07 '15 at 05:44

peak

105,803
17
152
177

smart fixing the array chaos. – karakfa Dec 07 '15 at 17:54

ghoti · Answer 5 · 2015-12-15T15:09:24.767

Here's my take on this. Tested in FreeBSD, so I'd expect it to work just about anywhere...

#!/usr/bin/awk -f

BEGIN {
  depth=1;
}

$1 ~ /^#(\.#)*\)$/ {
  thisdepth=split($1, _, ".");

  if (thisdepth < depth) {
    # end of subsection, back out to current depth by deleting array values
    for (; depth>thisdepth; depth--) {
      delete value[depth];
    }
  }
  depth=thisdepth;

  # Increment value of last member
  value[depth]++;

  # And substitute it into the current line.
  for (i=1; i<=depth; i++) {
    sub(/#/, value[i], $0);
  }
}

1

The basic idea is that we maintain an array (value[]) of our nested chapter values. After updating the array as required, we step through the values, substituting the first occurrence of the octothorpe (#) each time with the current value for that position of the array.

This will handle any level of nesting, and as I mentioned above, it should work both in GNU (Linux) and non-GNU (FreeBSD, OSX, etc) versions of awk.

And of course, if one-liners are your thing, this can be compacted:

awk -vd=1 '$1~/^#(\.#)*\)$/{t=split($1,_,".");if(t<d)for(;d>t;d--)delete v[d];d=t;v[d]++;for(i=1;i<=d;i++)sub(/#/,v[i],$0)}1'

which could also be expressed, for easier reading, like this:

awk -vd=1 '$1~/^#(\.#)*\)$/{              # match only the lines we care about
    t=split($1,_,".");                    # this line has 't' levels
    if (t<d) for(;d>t;d--) delete v[d];   # if levels decrease, trim the array
    d=t; v[d]++;                          # reset our depth, increment last number
    for (i=1;i<=d;i++) sub(/#/,v[i],$0)   # replace hash characters one by one
  } 1'                                    # and print.

UPDATE

And after thinking about this for a bit, I realize that this can be shrunk further. The for loop contains its own condition, there's no need to place it inside an if. And

awk '{
    t=split($1,_,".");                  # get current depth
    v[t]++;                             # increment counter for depth
    for(;d>t;d--) delete v[d];          # delete record for previous deeper counters
    d=t;                                # record current depth for next round
    for (i=1;i<=d;i++) sub(/#/,v[i],$0) # replace hashes as required.
  } 1'

Which of course minifies into a one liner like this:

awk '{t=split($1,_,".");v[t]++;for(;d>t;d--)delete v[d];d=t;for(i=1;i<=d;i++)sub(/#/,v[i],$0)}1' file

Obviously, you can add the initial match condition if you require it, so that you only process lines that look like titles.

Despite being a few characters longer, I believe this version runs ever so slightly faster than karakfa's similar solution, probably because it avoids the extra if for each iteration of the for loop.

UPDATE #2

I include this because this because I found it fun and interesting. You can do this in bash alone, no need for awk. And it's not much longer in terms of code.

#!/usr/bin/env bash

while read word line; do
  if [[ $word =~ [#](\.#)*\) ]]; then
    IFS=. read -ra a <<<"$word"
    t=${#a[@]}
    ((v[t]++))
    for (( ; d > t ; d-- )); do unset v[$d]; done
    d=t
    for (( i=1 ; i <= t ; i++ )); do
      word=${word/[#]/${v[i]}}
    done
  fi
  echo "$word $line"
done < input.txt

This follows the same logic as the awk script above, but works entirely in bash using Parameter Expansion to replace # characters. One flaw it suffers from is that it does not maintain whitespace around the first word on every line, so you'd lose any indents. With a bit of work, that could be mitigated too.

Enjoy.

ramana_k · Answer 6 · 2015-12-07T05:19:34.193

Here is another way to do it.

Explanation is provided below the code.

awk 'BEGIN {n0=1; prev=0}
   {n1=split($1, elems, ".");  # Get the number of pound signs
    dif = (n1-n0);             # Increase in topic depth from previous line
    scale = (10 ^ dif);        # 10 raised to dif
    current=(int(prev*scale)+1);  # scale the number by change in depth
    withdots=gensub(/([0-9])/, "\\1." , "g", current);  # dot between digits
    {print withdots, $2 }
     n0=n1;
     prev=current}' number.txt


1) Title
2) Title
3) Title
3.1) Subtitle
3.1.1) Section
3.2) Subtitle
4) Title

Consider the topic numbers as decimal numbers.
We get the current number from the previous one by the formula 10 ^ dif + 1,

where dif = (Increase in number of levels from previous line) Initially, dif is zero, so we get 2 from 1 and 3 from 2,
by 1 * (10 ^ 0) +1 = 1 * 1 + 1 = 2
and 2 * (10 ^ 0) +1 = 2 * 1 + 1 = 3

Then we get 31 from 3 by 3 * (10 ^ 1) + 1
32 from 311 by 311 * (10 ^ -1) + 1 and so on

Should point out that this solution works in GNU awk (gawk) but not the awk in OSX, FreeBSD, etc. `gensub` is not portable. — ghoti, Dec 07 '15 at 06:16

awk dynamic document indexing

6 Answers6