AWK - where to search within AWK program?

Question

This question is very specific to the structure of an AWK program (with function) provided in response to Print columns using string variable in AWK.

Those responding to my initial question have helped me partially understand the solution provided. I attempted to write the provided program as a single line as I thought it would help me understand more, but ended up in a complete mess. I have now taken the solution and added a new function in my bash script.

create_selected() {
    echo ............creating selected ..............
    awk -F "," -v cols=$Columns '
        BEGIN {
            n=split(cols,Fields,/,/)
        }
        function _get_cols(i,s){
            for(i=1;i<=n;i++) s = length(s) ? s OFS $(Fields[i]) : $(Fields[i])
            return s
        }
        {
            print _get_cols()
        }' myfile.csv
}

The above works well, but only partly reaches my object. I need to only return the columns (as specified by $Columns) in myfile.csv that have a given string within the line. The string is captured elsewhere in the script as $Searchfor.

I believe I must 'pass' this string to awk, and then /Search/ each line. My attempts have failed. This may be due to my lack of understanding of the awk 'BEGIN{} {BODY} END{}' blocks, or perhaps my understanding of how the above solution works. Perhaps it simply doesn't allow me to search as well as use a string to define the columns (?)

I have tried several variations, even moving the awk function before BEGIN, which I have seen in examples on the web. My initial attempt is below, which I thought was the most simple, but was my first failure. Am I able to use search in this solution?

create_selected ()
{
echo ............creating selected ..............
awk -F "," -v searched=$Searchfor -v cols=$Columns 'BEGIN{
       n=split(cols,Fields,/,/)
}
function _get_cols(i,s){
       for(i=1;i<=n;i++) s = length(s) ? s OFS $(Fields[i]) : $(Fields[i])
       return s
}
{
    /searched/ print _get_cols()
}' myfile.csv
}

result

............creating selected ..............
awk: cmd. line:9:     /searched/ print _get_cols() 
awk: cmd. line:9:                 ^ syntax error

Inputs

echo $Columns
1 3 6
echo $Searchfor
dir1

cat myfile.csv
/data/Files/dir1/record_2023-01-11-15-20-00.csv.gz:2023-01-11 15:18:07.634,2023-01-11 15:17:03.683,2023-01-11 15:17:03.763,40,0,5253763,10.106.144.2,34334,157.240.221.34,443,6,281,1,59,1,0,0,0,0
/data/Files/dir2/record_2023-01-11-15-20-00.csv.gz:2023-01-11 15:18:07.634,2023-01-11 15:17:03.683,2023-01-11 15:17:03.763,40,0,5253763,10.106.144.2,34334,157.240.221.34,443,6,281,1,59,1,0,0,0,0
/data/Files/dir3/record_2023-01-11-15-20-00.csv.gz:2023-01-11 15:18:07.634,2023-01-11 15:17:03.683,2023-01-11 15:17:03.763,40,0,5253763,10.106.144.2,34334,157.240.221.34,443,6,281,1,59,1,0,0,0,0

Required Output

/data/Files/dir1/record_2023-01-11-15-20-00.csv.gz:2023-01-11 15:18:07.634 2023-01-11 15:17:03.763 5253763

I posted an answer to your specific question here about the syntax error but if you'd like help coming up with the best solution for whatever it is you're trying to do then post a new question with a [mcve] that contains concise, testable sample input and expected output. There's only so much we can do to help you with input/output that demonstrates your needs and we can copy/paste to test with. Not my downvote btw. — Ed Morton, Jan 11 '23 at 15:49
Appreciate what you are asking, however the files being searched are huge log files with over 200 columns, with content from no characters to many (long URLs). TBH I am also editing variable names to be generic for illustration, and I keep failing to align my edits. I will try and add some generic information, as I believe it is the structure more than the input/output. NB one of my attempts was to add more curly braces. I will revisit what you have highlighted! — Chrizk, Jan 11 '23 at 15:52
None of that matters, and all of it is common. You just need to come up with a [mcve] for us to best help you. Make it 4-5 lines of 4-5 columns of truly representative values - we don't need to see huge files of huge numbers of columns to understand a problem. Wrt`NB one of my attempts was to add more curly braces` - that and moving the function definition around sounds like you're thrashing just trying things without understanding the structure of an awk program, get the book Effective AWK Programming, 5th Edition, by Arnold Robbins and read the first few pages to learn the fundamentals. — Ed Morton, Jan 11 '23 at 15:59
I really can't argue with you. I have struggled to understand the only solution I found to use $Columns to define the columns needed (being initial/linked question). I have attempted to understand it, but have had to accept it, which has led me 'work with it' regardless of the full understanding. I think this is all to do with the function returning the required fields to print, moreso, where the print is located in the awk program structure. — Chrizk, Jan 11 '23 at 16:24

Ed Morton · Accepted Answer · 2023-01-11T17:15:48.353

0

The current syntax error is because you did { /searched/ print _get_cols() } instead of /searched/ { print _get_cols() } or { if (/searched/) print _get_cols() } but what I think you meant to do instead was $0 ~ searched { print _get_cols() } or index($0,searched) { print _get_cols() } or similar.

Given your newly posted sample input/output, here's how I'd do it, using any awk:

$ cat tst.sh
#!/usr/bin/env bash

create_selected() {
    local inFldNrs="$1" tgtDir="$2" file="$3"

    echo '............creating selected ..............' >&2

    awk -v inFldNrs="$inFldNrs" -v tgtDir="$tgtDir" '
        BEGIN {
            numOutFlds = split(inFldNrs,out2in)
            FS = ","
        }

        function get_vals(      outFldNr,inFldNr,vals) {
            for ( outFldNr=1; outFldNr<=numOutFlds; outFldNr++ ) {
                inFldNr = out2in[outFldNr]
                vals = (outFldNr == 1 ? "" : vals OFS) $inFldNr
            }
            return vals
        }

        {
            n = split($1,path,"/")
            curDir = path[n-1]
        }
        curDir == tgtDir {
            print get_vals()
        }
    ' "$file"
}

Columns='1 3 6'
Searchfor='dir1'
Infile='myfile.csv'

create_selected "$Columns" "$Searchfor" "$Infile"

$ ./tst.sh
............creating selected ..............
/data/Files/dir1/record_2023-01-11-15-20-00.csv.gz:2023-01-11 15:18:07.634 2023-01-11 15:17:03.763 5253763

edited Jan 11 '23 at 17:15

answered Jan 11 '23 at 15:46

Ed Morton

188,023
17
78
185

1

Not sure why Comments are not meant to be used for thanks, but wow! Really appreciate all the effort. Your examples are really clear (with helpful variable names). I get the general idea of what you have done, but will take some time to work through the detail. What I thought was a simple exercise turned into an incredible journey. – Chrizk Jan 12 '23 at 08:11
Given the question, if I have understood your solution, you end up comparing $Searchfor with the position in the path (split by "/"). You are obviously very knowledgeable, and have chosen to do this with good reason. This seems to be part of my lack of understanding ...why wouldn't you simply search for an occurrence of tgtDir? Is it that you have provided an absolute correct solution for the stated requirement, or that awk must have a logical operation instead of a search (as I badly attempted) NB this is only for my understanding of AWK, I would never question your knowledge. – Chrizk Jan 12 '23 at 08:35
@Chrizk I wouldn't just search for `tgtDir` just anywhere in the line as it might occur in some location on the line where you don't expect/want it to match and/or it might undesirably occur as a substring of some other string. I'm also using a string rather than regexp comparison so it won't falsely match or fail to match if `tgtDir` name contains regexp metachars. It's usually easy to match what/where you expect to match but much harder to not match similar strings that you don't want to match. – Ed Morton Jan 12 '23 at 13:44
Regarding `Is it that you have provided an absolute correct solution for the stated requirement, or that awk must have a logical operation instead of a search` - it's the former. There is no general "search" though, in awk you just write whatever comparison you want - for example `/foo/` is a regexp comparison of `foo` against `$0`, it's just shorthand for `$0 ~ /foo/` (nothing special), while `$0 == "foo"` is a string comparison of `foo` against `$0`. See https://stackoverflow.com/q/65621325/1745001 for examples of different comparisons. – Ed Morton Jan 12 '23 at 13:48
Feel free to ask questions if you have any after looking at man pages and thinking about it a bit. – Ed Morton Jan 12 '23 at 13:53
1

Your explanations are very clear and helpful. Out of curiosity, I replaced `curDir == tgtDir` with `/tgtDir/`, but it didn't work as a comparison (no result, no error). However, replacing with `$0 ~ tgtDir` did (being what you surmised I was trying to do before working through my whole solution!). Anyway, I have taken up much of your time, and very much appreciate all your advice, experience and wisdom. – Chrizk Jan 12 '23 at 14:37
`/tgtDir/` is just shorthand for `$0 ~ /tgtDir/` and means `compare $0 to the literal regexp tgtDir` while `$0 ~ tgtDir` means `compare $0 to the regexp contained in the variable named tgtDir` which is a very different thing. See https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script – Ed Morton Jan 12 '23 at 14:44

AWK - where to search within AWK program?

1 Answers1