Processing a delimited line in bash

Question

Given a single line of input with 'n' arguments which are space delimited. The input arguments themselves are variable. The input is given through an external file.

I want to move specific elements to variables depending on regular expressions. As such, I was thinking of declaring a pointer variable first to keep track of where on the line I am. In addition, the assignment to variable is independent of numerical order, and depending on input some variables may be skipped entirely.

My current method is to use awk '{print $1}' file.txt However, not all elements are fixed and I need to account for elements that may be absent, or may have multiple entries.

UPDATE: I found another method.

file=$(cat /file.txt)
for i in ${file[@]}; do 
   echo $i >> split.txt; 
done

With this way, instead of a single line with multiple arguments, we get multiple lines with a single argument. as such, we can now use var#=(grep --regexp="[pattern]" split.txt. Now I just need to figure out how best to use regular expressions to filter this mess.

Let me take an example.

My input strings are:

RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32

RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00

RON KKND 2234Z CON 342523G CLS 01/M12 RMK

So the variable assignment for each of the above would be:

var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE var6=253985G varC=2 varC1=034SRT varC2=134OVC var7=04/32

var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE var6=143623G72K varC=4 varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var7=13/00

var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE var6=342523G varC=0  var7=01/M12

So, the fourth argument might be var4, var5, or var6. The fifth argument might be var5, var6, or match another criteria. The sixth argument may or may not be var6. Between var6 and var7 can be determined by matching each argument with */*

Boiling this down even more, The positions on the input of var1, var2 and var3 are fixed but after that I need to compare, order, and assign. In addition, the arguments themselves can vary in character length. The relative position of each section to be divided is fixed in relation to its neighbors. var7 will never be before var6 in the input for example, and if var4 and var5 are true, then the 4th and 5th argument would always be 'AUTO CON' Some segments will always be one argument, and others more than one. The relative position of each is known. As for each pattern, some have a specific character in a specific location, and others may not have any flag on what it is aside from its position in the sequence.

So I need awk to recognize a pointer variable as every argument needs to be checked until a specific match is found

#Check to see if var4 or var5 exists. if so, flag and increment pointer
pointer=4
if (awk '{print $$pointer}' file.txt) == "AUTO" ; then
   var4="TRUE"
   pointer=$pointer+1
else
   var4="FALSE"
fi
if (awk '{print $$pointer}' file.txt) == "CON" ; then
   var5="TRUE"
   pointer=$pointer+1
else
   var5="FALSE"
fi

#position of var6 is fixed once var4 and var5 are determined
var6=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1

#Count the arguments between var6 and var7 (there may be up to ten)
#and separate each to decode later. varC[0-9] is always three upcase 
# letters followed by three numbers. Use this counter later when decoding.
varC=0

until (awk '{print $$pointer}' file.txt) == "*/*" ; do

   varC($varC+1)=(awk '{print $$pointer}' file.txt)
   varC=$varC+1
   pointer=$pointer+1
done
#position of var7 is fixed after all arguments of varC are handled
var7=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1

I know the above syntax is incorrect. The question is how do I fix it.

var7 is not always at the end of the input line. Arguments after var7 however do not need to be processed.

Actually interpreting the patterns I haven't gotten to yet. I intend to handle that using case statements comparing the variables with regular expressions to compare against. I don't want to use awk to interpret the patterns directly as that would get very messy. I have contemplated using for n in $string, but to do that would mean comparing every argument to every possible combination directly (And there are multiple segments each with multiple patterns) and is such impractical. I'm trying to make this a two step process.

Well tried but sorry not that clear, please do add samples output and expected output sample more clearly along with how you need output(logic). — RavinderSingh13, Sep 20 '19 at 02:22
to my understanding , a list of pattern should be applied to every column every row, and matching priority matters . the sequence of input row and content of the column may change a lot. please update with sample pattern to match the given input . **sidenote: it is doable with awk , but quite a heavy work** — James Li, Sep 20 '19 at 03:19
Read this on how to use variable in `awk` https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script This `awk '{print $$pointer}' file.txt` is wrong and will not work. — Jotne, Sep 20 '19 at 04:33
I attempted the above, but each time the commands treat the entire input as the full line, not break it apart like I am attempting to do. I know ```awk '{print $$pointer}' file.txt``` is incorrect. That's the entire question here, how do I fix it. — Richard Engler, Sep 20 '19 at 11:24
related but not answering your question: Is JSON acceptable as alternative solution. Plain text data is not programming friendly. It is messy to write a parser now, It can be a disaster to write another parser after a few months, or a few years. — James Li, Sep 20 '19 at 15:32
I don't have the option to use another language. I only have BASH to work with. — Richard Engler, Sep 20 '19 at 17:57
I feel like this might be a bit of an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Can you perhaps explain what you're trying to achieve here? Bigger picture? I'd love to find a simpler way to do what you need without the messy pattern matching. — ghoti, Sep 21 '19 at 05:51
Use `mapfile -t array file.txt` to read `file.txt` into `array`. It is better to avoid the use of `file` as a variable name as there is a Linux utility named `file`. — David C. Rankin, Sep 21 '19 at 06:50
Not quite an XY problem. The question is still there. Given a single line delimited with arguments of variable length, how to filter it out. I just posted things I have tried so far. From what I have seen so far... there is no simple way to accomplish this. — Richard Engler, Sep 22 '19 at 16:05

James Li · Answer 1 · 2019-09-22T14:00:25.427

Updated: This code shows how to decide variable value based on pattern match , multiple times.
one code block in pure bash and the other in gawk manner

bash code block requires associative Array support, which is not available in very early versions
grep is also required to do pattern matching
tested with GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) and grep (GNU grep) 2.20 and stick to printf other than echo after I learn why-is-printf-better-than-echo
when using bash I consider it good practice to be more defensive

#!/bin/bash
declare -ga outVars
declare -ga lineBuf
declare -g NF
#force valid index starts from 1
#consistent with var* name pattern
outVars=(unused var1 var2 var3 var4 var5 var6 varC var7)
((numVars=${#outVars[@]} - 1))
declare -gr numVars
declare -r outVars

function e_unused {
    return
}
function e_var1 {
    printf "%s"  "${lineBuf[1]}"
}
function e_var2 {
    printf "%s"  "${lineBuf[2]}"
}
function e_var3 {
    printf "%s"  "${lineBuf[3]}"
}

function e_var4 {
    if [ "${lineBuf[4]}" == "AUTO" ] ;
    then
        printf "TRUE"
    else
        printf "FALSE"
    fi
}
function e_var5 {
    if [ "${lineBuf[4]}" == "CON" ] ;
    then
        printf "TRUE"
    else
        printf "FALSE"
    fi
}
function e_varC {
    local var6_idx=4
    if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
        then
            var6_idx=5
    fi

    local var7_idx=$NF
    local i
    local count=0
    for ((i=NF;i>=1;i--));
    do
        if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
            then
            var7_idx=$i
            break
        fi
    done
    ((varC = var7_idx - var6_idx - 1))
    if [ $varC -eq 0 ];
        then
        printf 0
        return;
    fi
    local cFamily=""
    local append
    for ((i=var6_idx;i<=var7_idx;i++));
    do
        if [ $(grep -cE '^[0-9]{3}[A-Z]{3}$' <<<${lineBuf[$i]}) -eq 1 ];
            then
            ((count++))
            cFamily="$cFamily varC$count=${lineBuf[$i]}"
        fi
    done
    printf "%s %s"  $count "$cFamily"
}

function e_var6 {
    if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
        then
        printf "%s"  "${lineBuf[5]}"
    else
        printf "%s"  "${lineBuf[4]}"
    fi
}
function e_var7 {
    local i
    for ((i=NF;i>=1;i--));
    do
        if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
            then
            printf "%s"  "${lineBuf[$i]}"
            return
        fi
    done
}

while read  -a lineBuf ;
    do
    NF=${#lineBuf[@]}
    lineBuf=(unused ${lineBuf[@]})
    for ((i=1; i<=numVars; i++));
        do
        printf "%s="  "${outVars[$i]}"
        (e_${outVars[$i]})
        printf " "
    done
    printf "\n"

done <file.txt

The gawk specific extension Indirect Function Call is used in the awk code below
the code assigns a function name for every desired output variable.
different pattern or other transformation can be applied in its specific function
doing so to avoid tons of if-else-if-else
and is also easier to read and extend.
for the special varC family, the function pick_varC played a trick
after varC is determined ,its value consists of multiple output fields.
if varC=2, the value of varC is returned as 2 varC1=034SRT varC2=134OVC
that is actual value of varC appending all follow members.

gawk '
    BEGIN {
        keys["var1"] = "pick_var1";
        keys["var2"] = "pick_var2";
        keys["var3"] = "pick_var3";
        keys["var4"] = "pick_var4";
        keys["var5"] = "pick_var5";
        keys["var6"] = "pick_var6";
        keys["varC"] = "pick_varC";
        keys["var7"] = "pick_var7";
    }

    function pick_var1 () {
        return $1;
    }
    function pick_var2 () {
        return $2;
    }
    function pick_var3 () {
        return $3;
    }

    function pick_var4 () {
        for (i=1;i<=NF;i++) {
            if ($i == "AUTO") {
                return "TRUE";
            }
        }
        return "FALSE";
    }

    function pick_var5 () {
        for (i=1;i<=NF;i++) {
            if ($i == "CON") {
                return "TRUE";
            }
        }
        return "FALSE";
    }

    function pick_varC () {
        for (i=1;i<=NF;i++) {
            if (($i=="AUTO" || $i=="CON")) {
                break;
            }
        }
        var6_idx = 5;
        if ( i!=4 ) {
            var6_idx = 4;
        }
        var7_idx = NF;
        for (i=1;i<=NF;i++) {
            if ($i~/.*\/.*/) {
                var7_idx = i;
            }
        }
        varC = var7_idx - var6_idx - 1;
        if ( varC == 0) {
            return varC;
        }
        count = 0;
        cFamily = "";
        for (i = 1; i<=varC;i++) {
            if ($(var6_idx+i)~/[0-9]{3}[A-Z]{3}/) {
                cFamily = sprintf("%s varC%d=%s",cFamily,i,$(var6_idx+i));
                count++;
            }
        }
        varC = sprintf("%d %s",count,cFamily);
        return varC;
    }

    function pick_var6 () {
        for (i=1;i<=NF;i++) {
            if (($i=="AUTO" || $i=="CON")) {
                break;
            }
        }
        if ( i!=4 ) {
            return $4;
        } else {
            return $5
        }
    }

    function pick_var7 () {
        for (i=1;i<=NF;i++) {
            if ($i~/.*\/.*/) {
                return $i;
            }
        }
    }

    {
        for (k in keys) {
            pickFunc = keys[k];
            printf("%s=%s ",k,@pickFunc());
        }
        printf("\n");
    }
    ' file.txt

test input

RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32
RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00
RON KKND 2234Z CON 342523G CLS 01/M12 RMK

script output

var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE varC=2  varC1=034SRT varC2=134OVC var6=253985G var7=04/32
var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE varC=4  varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var6=143623G72K var7=13/00
var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE varC=0  var6=342523G var7=01/M12

I didn't put the patterns(rules) as there are dozens of them that need to be checked, and each different depending on the variables. Instead of checking everyone against dozens of variants I'm trying to break it all apart so I can check each one only against the relevant variants to that variable. The above won't handle well particular patterns that should be skipped entirely, and others that may have multiple entries to be checked with the same rules (varC). The above is also simplified and there are even more variables than the above I need to deal with in the actual work. — Richard Engler, Sep 20 '19 at 11:32
sorry I did not expect you have nested variable inside one line, I will reconsider the code structure to handle varC family. but I did not get the case of ```won't handle well particular patterns that should be skipped entirely``` — James Li, Sep 20 '19 at 13:27
Handling all of the input as nested variables seems like the straightest method with the least overhead. Sometimes at specific points of the code there will be patterns which are to be skipped, and those I can ferret out using a similar method to uncovering var7. That is another reason I'm trying to use a pointer for operations. I also updated the above to include a section where varC is unused. We can also use ```input=@(cat file.txt)``` if managing an internal variable is easier than using awk — Richard Engler, Sep 20 '19 at 13:39
```three letters followed by three numbers. ``` is not fully consistent with the test data, I assumed test data is sufficient and the regex is looking for ```three UPCASE letters followed by three numbers``` — James Li, Sep 20 '19 at 15:24
All letters are upcase. Only problem I see is gawk isn't enabled, and it would be better to manage somehow without having the user need to install — Richard Engler, Sep 20 '19 at 17:55
please provide this **vital requirement** on the interpreters that the user have or allowed to use. On many popular Linux installations, awk is installed as an alias to gawk, so I stated gawk is used but do not expect it is not enabled. — James Li, Sep 21 '19 at 00:17
sh/bash is hard for doing such heavy nested data processing, doable but will easily fall into spaghetti code style. disaster for maintainers. — James Li, Sep 21 '19 at 00:25
Note that while `grep -P` may be available in Linux, it is not BSD or macOS/darwin, where this functionality has been removed. Though, of course `'^[0-9]{3}[A-Z]{3}$'` is perfectly fine as ERE (`egrep`). — ghoti, Sep 21 '19 at 05:55
@ghoti thanks I know little about BSD and macOS . switched to -E option — James Li, Sep 21 '19 at 06:32

score 0 · Answer 2 · answered Sep 20 '19 at 05:51

Please try the following:

#!/bin/bash

# template for variable names
declare -a namelist1=( "var1" "var2" "var3" "var4" "var5" "var6" "varC" )
declare -a ary

# read each line and assign ary to the elements
while read -r -a ary; do
    if [[ ${ary[3]} = AUTO ]]; then
        ary=( "${ary[@]:0:3}" "TRUE" "FALSE" "${ary[4]}" "" "${ary[@]:5:3}" )
    elif [[ ${ary[3]} = CON ]]; then
        ary=( "${ary[@]:0:3}" "FALSE" "TRUE" "${ary[4]}" "" "${ary[@]:5:3}" )
    else
        ary=( "${ary[@]:0:3}" "FALSE" "FALSE" "${ary[3]}" "" "${ary[@]:4:5}" )
    fi
    # initial character of the 7th element
    ary[6]=${ary[7]:0:1}

    # locate the index of */* entry in the ary and adjust the variable names
    for (( i=0; i<${#ary[@]}; i++ )); do
        if [[ ${ary[$i]} == */* ]]; then
            declare -a namelist=( "${namelist1[@]}" )
            for (( j=1; j<=i-7; j++ )); do
                namelist+=( "$(printf "varC%d" "$j")" )
            done
            namelist+=( "var7" )
        fi
    done

    # assign variables to array elements
    for (( i=0; i<${#ary[@]}; i++ )); do
#       echo -n "${namelist[$i]}=${ary[$i]} "       # for debugging
        declare -n p="${namelist[$i]}"
        p="${ary[$i]}"
    done
#   echo "var1=$var1 var2=$var2 var3=$var3 ..."     # for debugging
done < file.txt

Note that the script above just assigns bash variables and does not print anything unless you explicitly echo or printf the variables.

Processing a delimited line in bash

2 Answers2