Trying to read from specific fields of a CSV file

Question

The code provided reads a CSV file and prints the count of all strings found in descending order. However, I would like to know how to specify what fields I would like to read in count...for example ./example-awk.awk 1,2 file.csv would read strings from fields 1 and 2 and print the counts

    #!/bin/awk -f

BEGIN {
    FIELDS = ARGV[1];
    delete ARGV[1];
    FS = ", *"
}

{
    for(i = 1; i <= NF; i++)
        if(FNR != 1)
        data[++data_index] = $i
}

END {
    produce_numbers(data)

    PROCINFO["sorted_in"] = "@val_num_desc"

    for(i in freq)
        printf "%s\t%d\n", i, freq[i]
}

function produce_numbers(sortedarray)
{
    n = asort(sortedarray)

    for(i = 1 ; i <= n; i++)
    {
        freq[sortedarray[i]]++
    }
    return
}

This is currently the code I am working with, ARGV[1] will of course be the specified fields. I am unsure how to go about storing this value to use it.

For example ./example-awk.awk 1,2 simple.csv with simple.csv containing

A,B,C,A
B,D,C,A
C,D,A,B
D,C,A,A

Should result in

Because it only counts strings in fields 1 and 2

Can you not use the -v flag and so ./example-awk.awk -v arg1=1 -v arg2=2 simple.csv. Then use the variables arg1 and arg2 in the actual script? — Raman Sailopal, Oct 21 '20 at 15:10
Unfortunately no, this does regard an assignment where the format is specified to be this... If it was not specified in such a way I would probably being having an easier time to say the least. I am not sure how I would read in the command line argument and split into useable values even in another language. Also simple.csv has its contents towards the end of the question @RavinderSingh13 — Just Another Coder, Oct 21 '20 at 15:13
Never use a shebang to call awk - see https://stackoverflow.com/a/61002754/1745001 and https://unix.stackexchange.com/a/563456/133219 for some reasons why. — Ed Morton, Oct 21 '20 at 17:17

RavinderSingh13 · Accepted Answer · 2020-10-21T17:13:31.413

4

EDIT(as per OP's request): As per OP he/she needs to have solution using ARGV so adding solution as per that now (NOTE: cat script.awk is only written to show content of actual awk script only).

cat script.awk
BEGIN{
  FS=","
  OFS="\t"
  for(i=1;i<(ARGC-1);i++){
     arr[ARGV[i]]
     delete ARGV[i]
  }
}   
{
  for(i in arr){ value[$i]++ }
}
END{
  PROCINFO["sorted_in"] = "@ind_str_desc"
  for(j in value){
     print j,value[j]
  }
}

Now when we run it as follows:

awk -f script.awk 1 2 Input_file
D       3
C       2
B       2
A       1

My original solution: Could you please try following, written and tested with shown samples. It is a generic solution where awk program has a variable named fields where you could mention all field numbers which you want to deal with using ,(comma) separator in it.

awk -v fields="1,2" '
BEGIN{
  FS=","
  OFS="\t"
  num=split(fields,arr,",")
  for(i=1;i<=num;i++){
    key[arr[i]]
  }
}
{
for(i in key){
  value[$i]++
 }
}
END{
  for(i in value){
    print i,value[i]
  }
}' Input_file | sort -rk1

Output will be as follows.

edited Oct 21 '20 at 17:13

answered Oct 21 '20 at 15:20

RavinderSingh13

130,504
14
57
93

As I said in the comment of the post, I unfortunately can not format it like this because it is specified to use ARGV[1] in order to catch the fields with no -v option needed. However, the use of split could be very useful so I will attempt a version with no -v but using the split functionality – Just Another Coder Oct 21 '20 at 15:25
@JustAnotherCoder, IMHO there is no need to use of ARGV etc when we have `-v` option available. Any specific reason for not using it(will try to add ARGV too if possible)? – RavinderSingh13 Oct 21 '20 at 15:27
@JustAnotherCoder, ok please check my EDIT solution and let me know then. – RavinderSingh13 Oct 21 '20 at 15:34
1

Piecing a couple parts of the provided code together I updated my solution which now works as intended, thank you for your time I truly appreciate it @RavinderSingh13 – Just Another Coder Oct 21 '20 at 15:52

score 4 · Answer 2 · answered Oct 21 '20 at 15:38

Don't use a shebang to invoke awk in a shell script as that robs you of the ability to use the shell and awk separately for what they both do best. Use the shebang to invoke your shell and then call awk within the script. You also don't need to use gawk-only sorting functions for this:

$ cat tst.sh
#!/usr/bin/env bash

(( $# == 2 )) || { echo "bad args: $0 $*" >&2; exit 1; }

cols=$1
shift

awk -v cols="$cols" '
BEGIN {
    FS = ","
    OFS = "\t"
    split(cols,tmp)
    for (i in tmp) {
        fldNrs[tmp[i]]
    }
}
{
    for (fldNr in fldNrs) {
        val = $fldNr
        cnt[val]++
    }
}
END {
    for (val in cnt) {
        print val, cnt[val]
    }
}
' "${@:--}" |
sort -r

$ ./tst.sh 1,2 file
D       3
C       2
B       2
A       1

I appreciate your advice, however the task I am completing specifies the format to use, however this information is great for future AWK use so once again I appreciate it. — Just Another Coder, Oct 21 '20 at 16:10
If someone is requiring you to do this in a way other than I show in my answer then you might want to question why they're doing so and what else they're asking you to do :-). — Ed Morton, Oct 21 '20 at 17:05

James Brown · Answer 3 · 2020-10-21T16:06:15.400

I decided to give it a go in the spirit of OP's attempt as kids don't learn if kids don't play (trying ARGIND manipulation (it doesn't work) and delete ARGV[] and some others that also didn't work):

$ gawk '
BEGIN {
    FS=","
    OFS="\t"
    
    split(ARGV[1],t,/,/)                     # field list picked from ARGV
    for(i in t)                              # from vals to index
        h[t[i]]
    delete ARGV[1]                           # ARGIND manipulation doesnt work
}
{
    for(i in h)                              # subset of fields processes
        a[$i]++                              # count hits
}
END {
    PROCINFO["sorted_in"]="@val_num_desc"    # ordering from OPs attempt
    for(i in a)
        print i,a[i]
}' 1,2 file

Output

You could as well drop the ARGV[] manipulation and replace the BEGIN block with:

$ gawk -v var=1,2 '
BEGIN {
    FS=","
    OFS="\t"
    
    split(var,t,/,/)                         # field list picked from a var
    for(i in t)                              # from vals to index
        h[t[i]]
} ...

As I have commented to others, I appreciate your time however the task specifies to pretty much not use any sort of -v argument. This info will be useful in future AWK programming however so thank you! — Just Another Coder, Oct 21 '20 at 16:13
That's why the first part of the answer is using `ARGV[]` manipulation and not `-v var` as the second part of which someone else might benefit from. — James Brown, Oct 21 '20 at 16:13

Trying to read from specific fields of a CSV file

3 Answers3