0

I am trying to turn a awk BEGIN code into a loop. The original code was used to edit data based the value of the "Batch" column and output a file.

This is the original code (that works great):

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  for(i=1;i<=NF;i++){
    if($i=="YBr"){
       field=i
    }
    if($i=="NationalCowID"){
       value=i
    }
  }
}
$field==1{
  for(i=value+1;i<=NF;i++){
       $i="*"
  }
}
1
' obvs.csv > obvs1.csv

The above code takes the file obvs.csv that looks like this:

NationalCowID,TestDate,Batch,LN,DIM,YBr,year,CH4,PLS,qtl
206004574,20141208,6,2,92,1,2014,424.4410055,NA,1
206004573,20141209,6,2,93,2,2014,436.4504712,NA,4
206004575,20141207,6,2,91,1,2014,380.94688,NA,6
206004576,20141208,6,2,92,2,2014,424.4410055,NA,7
206004579,20141209,6,2,93,2,2014,436.4504712,NA,8
206004571,20141207,6,2,91,1,2014,380.94688,NA,9

and edits the data to look like this (obvs1.csv):

NationalCowID,TestDate,Batch,LN,DIM,YBr,year,CH4,PLS,qtl
206004574,*,*,*,*,*,*,*,*,*,1
206004573,20141209,6,2,93,2,2014,436.4504712,NA,4   
206004575,*,*,*,*,*,*,*,*,*,6
206004576,20141208,6,2,92,2,2014,424.4410055,NA,7
206004579,20141209,6,2,93,2,2014,436.4504712,NA,8
206004571,*,*,*,*,*,*,*,*,*,9

I would like to take this code and turn it into a loop so that a new file is created with the edits begin applied to each value of the "Batch" column(1-6).I've read some examples and the command explanation, but I don't fully understand what each part of the code does. For example, how do I code $j compared to the already coded $i? this is the loop I have tried to create:

for j in {1..6}
do
awk '
BEGIN{
  FS=OFS=","
}
FNR==$j{
  for(i=1;i<=NF;i++){
    if($i=="Batch"){
       field=i
    }
    if($i=="NationalCowID"){
       value=i
    }
  }
}
$field==1{
  for(i=value+1;i<=NF;i++){
       $i="*"
  }
}
$j
' obvs.csv > obvs$j.csv
done

In the end I am hoping to have 6 files as follows:

obvs1.csv -> only lines with batch = 1 are edited
obvs2.csv -> only lines with batch = 2 are edited
obvs3.csv-> only lines with batch = 3 are edited
obvs4.csv-> only lines with batch = 4 are edited
obvs5.csv-> only lines with batch = 5 are edited
obvs6.csv-> only lines with batch = 6 are edited

So the file name corresponds to the "Batch" being used as an indicator for which line to edit. i.e. for obvs2.csv, for data lines where Batch equals 2, all columns except for the first and last would be edited to * . So far, I end up with 6 files that are named correctly but the edits within the file are not correct. Any direction/ code explanation is greatly appreciated!

Cae.rich
  • 171
  • 7
  • Please post samples of input and expected output, as it is NOT clear here. – RavinderSingh13 Nov 15 '18 at 01:53
  • @RavinderSingh13 I have made edits to the questions. Hopefully what I'm try to achieve is more clear now – Cae.rich Nov 15 '18 at 02:20
  • @Cae.rich, still not clear,what is the logic lines are having `*` in them and what is the condition of getting output into new files? Please be clear in it. – RavinderSingh13 Nov 15 '18 at 02:52
  • In the obvs.csv file, I would like to find every line where $3 == 1, then change $2 - $9 to * , then print entire file into a new file named obvs1.csv. I would like to create a loop that does this for $3 == 2, then $3 == 3, then $3 == 4 …. until $3 == 6. so in the end I will have a total of 6 file that are created when I run the loop. Dose that help at all to make things more clear? Sorry Im not the best at using proper coding terminology yet @RavinderSingh13 – Cae.rich Nov 15 '18 at 03:03
  • @Cae.rich, ok added solution now, try and let me know then? – RavinderSingh13 Nov 15 '18 at 03:35
  • The problem is that you are mixing bash variables and awk variables. Hence the suggested duplicate. The `awk` code you provide can also not perform the substitutions you mention. Mainly this because your for loop states `for(i=value+1;i<=NF;i++) $i="*"` this means that also the last field will become a `*` and your output does not give that. – kvantour Nov 15 '18 at 12:13
  • also you the location where you would want to write `$j` is incorrect. This should be in the other equality. Not `FNR` as `FNR` means `FILE NUMBER RECORD` and has nothing to do with the field value. – kvantour Nov 15 '18 at 12:24

2 Answers2

1

Could you please try following once.

awk '
BEGIN{
  FS=OFS=","
}
FNR==1{
  head=$0
  next
}
{
  count[$6]
  a[++val]=$0
}
END{
  for(i in count){
    for(j=1;j<=val;j++){
      num=split(a[j],array,",")
        if(!header["output_file"i]++){
          print (head) > "output_file"i
        }
      if(array[6]==i){
        for(k=2;k<=(num-1);k++){
          value=value?value OFS "*":"*"
        }
      }
      if(value){
        print (array[1],value,array[num]) > "output_file"i
      }
      else {print (a[j]) > "output_file"i}
      value=""
    }
  }
}'  Input_file

Above code will create 2 output files named output_file1 and output_file2 you could change output file names as per your wish too.

Output files values will be as follows.

cat output_file1
NationalCowID,TestDate,Batch,LN,DIM,YBr,year,CH4,PLS,qtl
206004574,*,*,*,*,*,*,*,*,1
206004573,20141209,6,2,93,2,2014,436.4504712,NA,4
206004575,*,*,*,*,*,*,*,*,6
206004576,20141208,6,2,92,2,2014,424.4410055,NA,7
206004579,20141209,6,2,93,2,2014,436.4504712,NA,8
206004571,*,*,*,*,*,*,*,*,9

cat output_file2
NationalCowID,TestDate,Batch,LN,DIM,YBr,year,CH4,PLS,qtl
206004574,20141208,6,2,92,1,2014,424.4410055,NA,1
206004573,*,*,*,*,*,*,*,*,4
206004575,20141207,6,2,91,1,2014,380.94688,NA,6
206004576,*,*,*,*,*,*,*,*,7
206004579,*,*,*,*,*,*,*,*,8
206004571,20141207,6,2,91,1,2014,380.94688,NA,9

Will add explanation too shortly.

EDIT: Adding explanation too above code now.

awk '                                                                  ##awk code starts from here.
BEGIN{                                                                 ##Starting BEGIN section for awk code here.
  FS=OFS=","                                                           ##Setting FS and OFS values here as comma for all lines.
}                                                                      ##Closing BEGIN block here.
FNR==1{                                                                ##Checking condition when first line is being read for Input_file.
  head=$0                                                              ##Creating a variable named head whose value is current line value.
  next                                                                 ##next will skip all further statements from here.
}                                                                      ##Closing FNR==1 condition block here.
{                                                                      ##Starting main block for awk code here.
  count[$6]                                                            ##Creating an array named count whose index is $6(6th field of current line).
  a[++val]=$0                                                          ##Creating an array named a whose index is variable val value increment with 1 and value is $0.
}                                                                      ##Closing main section of awk code here.
END{                                                                   ##Mentioning END block of this awk code here.
  for(i in count){                                                     ##Starting a for loop to traverse through count array.
    for(j=1;j<=val;j++){                                               ##Starting for loop which runs from j value from 1 to value of val.
      num=split(a[j],array,",")                                        ##Spliting value of array a into array named array and its length stores in num variable.
        if(!header["output_file"i]++){                                 ##Checking if string output_file with i value is NOT in header array then do following.
          print (head) > "output_file"i                                ##Printing header(taken from 1st line of Input_file) to output_file i(integer value).
        }                                                              ##Closing for block for j<=val one here.
      if(array[6]==i){                                                 ##Checking condition if array[6] value is equal to i then do following.
        for(k=2;k<=(num-1);k++){                                       ##Starting a for loop from k=2 to till value of (num-1) here.
          value=value?value OFS "*":"*"                                ##Creating variable value which is having * in it and its value is concatenating.
        }                                                              ##Closing for loop block here.
      }                                                                ##Closing if condition block here.
      if(value){                                                       ##Checking condition if variable value is NOT NULL then do following.
        print (array[1],value,array[num]) > "output_file"i             ##printing array[1],value,array[num] values to output_file i(interger) value output file.
      }                                                                ##Closing block for if condition here.
      else {print (a[j]) > "output_file"i}                             ##Going to else part of if condition printing value of a[j] to output_file i here.
      value=""                                                         ##Nullifying variable value here.
    }                                                                  ##Closing block for for loop here.
  }
}' Input_file                                                     ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • @Cae.rich, try to encourage people by up-voting helpful and good answers and always try to select an answer as correct one out of many to close the thread completely. – RavinderSingh13 Nov 16 '18 at 18:52
  • 1
    Thanks for your explanation. Really helped to understand the code! – Cae.rich Nov 26 '18 at 00:00
0

The problem is that you are mixing bash variables and awk variables. Hence the suggested duplicate. The awk code you provide can also not perform the substitutions you mention.

This is how your original code should look like:

BEGIN{FS=OFS=","}
(FNR==1) { 
   for (i=1;i<=NF;++i) {
      if ($i == "YBr") bfield=i;
      else if ($i == "NationalCowID") cfield=i
   }
}
($bfield == 1) { for(i=cfield+1;i<NF;++i) $i="*" }
1

You can now adapt to include the value that $bfield should have:

awk -v bvalue="$j" '
   BEGIN{FS=OFS=","}
   (FNR==1) { 
      for (i=1;i<=NF;++i) {
         if ($i == "Batch") bfield=i;
         else if ($i == "NationalCowID") cfield=i
      }
   }
   ($bfield == bvalue) { for(i=cfield+1;i<NF;++i) $i="*" }
   1
' file.csv 
kvantour
  • 25,269
  • 4
  • 47
  • 72