2

I have a big CSV field, and I use awk with the field separator set to a comma. However, some fields are quoted and contain a comma, and I'm facing this issue:

Original file:

Downloads $  cat testfile.csv
"aaa","bbb","ccc","dddd"
"aaa","bbb","ccc","d,dd,d"
"aaa","bbb","ccc","dd,d,d"

I am trying this way:

Downloads $  cat testfile.csv | awk -F "," '{ print $2","$3","$4 }'
"bbb","ccc","dddd"
"bbb","ccc","d
"bbb","ccc","dd

Expecting result:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Palani
  • 21
  • 1
  • 2
    See likely duplicate [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/q/45420535/3422102) – David C. Rankin Mar 20 '22 at 04:30

3 Answers3

2

I would use a tool that is able to properly parse CSV, such as xsv. With it, the command would look like

$ xsv select 2-4 testfile.csv 
bbb,ccc,dddd
bbb,ccc,"d,dd,d"
bbb,ccc,"dd,d,d"

or, if you really want every value quoted, with a second step:

$ xsv select 2-4 testfile.csv | xsv fmt --quote-always
"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
1

Include (escaped) quotes in your field separator flag, and add them to your output print fields:

testfile.csv | awk -F "\",\"" '{print "\""$2"\",\""$3"\",\""$4}'

output:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"
Dave Pritlove
  • 2,601
  • 3
  • 15
  • 14
  • Not getting the expected result, you are taking 2nd field quotes and comma ``` Downloads $ cat testfile.csv | awk -F "\",\"" '{print $2"\",\""$3"\",\""$4}' bbb","ccc","dddd" >> starting double quote missing bbb","ccc","d,dd,d" bbb","ccc","dd,d,d" ``` – Palani Mar 20 '22 at 01:49
  • fixed the missing quote – Dave Pritlove Mar 20 '22 at 01:56
1

If gawk or GNU awk is available, you can make use of FPAT, which matches the fields, instead of splitting on field separators.

awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=, '{print $2, $3, $4}' testfile.csv

Result:

"bbb","ccc","dddd"
"bbb","ccc","d,dd,d"
"bbb","ccc","dd,d,d"

The string ([^,]+)|(\"[^\"]+\") is a regex pattern which matches either of:

  • ([^,]+) ... matches a sequence of any characters other than a comma.
  • (\"[^\"]+\") ... matches a string enclosed by double quotes (which may include commas in between).

The parentheses around the patterns are put for visual clarity purpose and the regex will work without them such as FPAT='[^,]+|\"[^\"]+\"' because the alternative | has lower precedence.

tshiono
  • 21,248
  • 2
  • 14
  • 22
  • [GNU AWK manual](https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html) suggest to use `FPAT = "([^,]*)|(\"[^\"]+\")"` – Daweo Mar 20 '22 at 08:15