2

I have has a csv in the format name,id,logindate where logindates appear as "July 15, YYYY HH:mm:ss" ie abc,123,"July 15, YYYY HH:mm:ss". Please note that there are headers and other information that should be skipped in the first 5 lines. So a sample csv file may look like:

AuditReport  
asdf  
qwerty  
asdf  
name, id, logindate   
experiment,182002, "July 31, 2022 20:00:00"  
unit 1998,183065, "July 3, 2022 21:00:00"  
asdf, 202065, "May 25, 2022 20:00:00" 

For my output, I would like to get the following (headers are removed):

experiment,182002, "July 31 2022 20:00:00"  
unit 1998,183065, "July 3 2022 21:00:00" 
asdf, 202065, "May 25 2022 20:00:00"

My main task is to be able to parse commas properly even with one being included in the string

After much google searching and going through several SO questions, I come to the conclusion that using a csv parser, some other language, or even GNU awk (using FPAT) is a better tool for this, but I m told that the production server in the company my dad works in uses awk and is assumed to not be gawk. (i m doing random small odd tasks to prepare myself for finding a job)

I m trying to workaround this by removing the "" and parsing by FS="," then concatenating the last two columns together again. However, my output keeps giving me 4 columns (unable to concatenate last two columns together into one column)

my code is:

/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ print $1, $2, $3" "$4 }' > test.csv

I also tried the following:
https://stackoverflow.com/a/48386788/16034206
awk '{$2=$2"-"$3;$3=""} 1' Input_file
In my case:
/usr/bin/env awk {BEGIN{FS=","} NR>5 {print}' sample.csv | awk '{ gsub("\"", "") } { $1=$1 } 1' | awk '{ $3=$3" "$4, $4=""} 1' > test.csv

3 Answers3

3

Whenever you find yourself piping the output of 1 call to awk to the input of another call to awk you're almost always going down the wrong path.

Using any awk:

$ awk 'BEGIN{FS=OFS="\""} NR>5{for (i=2; i<=NF; i+=2) gsub(/ *, */," ",$i); print}' sample.csv
experiment,182002, "July 31 2022 20:00:00"
unit 1998,183065, "July 3 2022 21:00:00"
asdf, 202065, "May 25 2022 20:00:00"

For a general way to handle more complicated CSVs using any awk, see What's the most robust way to efficiently parse CSV using awk?.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
3

With your shown samples please try following awk code. Written and tested in GNU awk, should work in any POSIX awk version. Using awk's match function to use regex "[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*" in it to get the expected results as per OP's request. Then using substr(to get sub strings) while printing the values.

awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
  val=substr($0,RSTART+1,RLENGTH-2)
  gsub(/,/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
' Input_file

OR in case you want to keep " before and after of needed values then a slight change in above code will do it:

awk '
match($0,/"[a-zA-Z]+ [0-9]{1,2}, [0-9]{4} [0-9]{2}(:[0-9]{2}){2}[^"]*"/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/,/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
'  Input_file

Here is the Online demo for used regex in above code for its understanding purposes.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

you don't have to worry about comma parsing once you've detected your header row by setting

  • FS = "^$"
echo "${_input_data_}" | 

mawk '_+=(!__<NF)*NR { FS="^$" } _<NR' FS=','
experiment,182002, "July 31, 2022 20:00:00"  
unit 1998,183065, "July 3, 2022 21:00:00"  
asdf, 202065, "May 25, 2022 20:00:00"
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
  • 1
    oh thanks for your answer, Could you explain how it works? – experiment unit 1998X Aug 01 '22 at 05:19
  • 1
    @experimentunit1998X : `NF` num fields can only be greater than one initially upon the header row, which sets the maximum row # to ignore - it also changes the column delimiter (`FS`) to a regex that means the entire row, so everything after that point gets printed without spending any resources splitting unnecessary field (`NF` value for all rows after than is either 0 (entirely empty, not even just spaces and tabs), or 1 (has anything) – RARE Kpop Manifesto Aug 01 '22 at 07:29
  • but in that case, if it returns the whole row without splitting, doesnt that mean that I wont be able to work on each individual column with any other operations etc? – experiment unit 1998X Aug 01 '22 at 09:57
  • 1
    @experimentunit1998X : u can manually `split()` the row or slightly modify the existing logic to account for that w/o changing `FS` - that's one of the greatness of `awk` : other than arrays and a very limited constraining set of rules 4 built-in vars (e.g. `NF>= 0`), you can have variables switch data types, as many times as you like, even within the same function - you can even have it swap the values of one numeric var containing pi, and one string var comprising all of Shakespeare's plays, with each other, w/o using a temp var, w/o XOR w/o arrays & w/o any risk of data corruption either. – RARE Kpop Manifesto Aug 01 '22 at 23:57