1

I have a CSV with multiple rows some of which look like this

"ABC","Unfortunately, system has failed"," - Error in system"
"DEF","Check the button labelled "WARNING"","Warning in system"
"XYZ","Everything is okay","No errors"

I need to split these lines and extract the columns such as

enter image description here

I run a loop for each of the rows and extract the 2nd column as

awk -F , '{print $2}' $line 

where $line represents each row. However, I end up getting incorrect values. For example, while trying to fetch 1st row 2nd column, using the above command gives me "Unfortunately and not "Unfortunately, system has failed"

I understand that my strings have both commas and quotes in them which makes it harder to split based on a delimiter. Is there anything else I can try?

Inian
  • 80,270
  • 14
  • 142
  • 161
  • 1
    Is this valid in any CSV dialect, I wonder? Properly the embedded double quotes should be escaped somehow. Usually this is done by doubling them, like `"Check the button labelled ""WARNING"""`. (Yeah, it's ugly, but then it would be; it's from Microsoft.) – tripleee Oct 07 '20 at 06:10
  • 1
    Running Awk on each line separately is inefficient and ugly. Just run it on the whole file; that's what it was built for (and the shell does a much worse job at splitting input into lines). Also, [quote your variables.](https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable) – tripleee Oct 07 '20 at 06:12
  • @tripleee Yes, I would expect the input to have escaped double quotes. Unfortunately, the input is sent from a different application, and we don't have a way to control the format they send – CluelessProgrammer Oct 07 '20 at 06:59
  • If you don't have GNU Awk, you can use `sed` to decorate each line (i.e. replace `","` with some other separator character) and then use `awk` to parse the 2nd field and restore the enclosing quotes with concatenation, e.g. `sed 's/","/~/g' file.csv | awk -F~ '{print "\"" $2 "\""}'` -- using `'~'` as the decoration. – David C. Rankin Oct 07 '20 at 06:59
  • @DavidC.Rankin - How would you know which comma to replace? It could either be the comma separating the columns, or a comma included within the string – CluelessProgrammer Oct 07 '20 at 07:15
  • 1
    By replacing the entire pattern `"\",\""` with `'~'` -- you know which to replace `:)` – David C. Rankin Oct 07 '20 at 07:44
  • Duh oh! You're right. Thanks :) – CluelessProgrammer Oct 08 '20 at 05:56

2 Answers2

1

Using GNU awk and FPAT:

$ gawk '
BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}
{
    print $2
}' file

Output:

"Unfortunately, system has failed"
"Check the button labelled "WARNING""
"Everything is okay"

It's not complete CSV parser, for example newlines inside quotes are not handled, you need to deal with them yourself (check NF and combine records). More about FPAT:

https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

If you want to get rid of those quotes:

$ gawk '
BEGIN {
    FPAT="([^,]+)|(\"[^\"]+\")"
}
{
    for(i=1;i<=NF;i++)            # loop all fields
        gsub(/^"|"$/,"",$i)       # remove quotes surrounding fields
    print $2
}' file

Output sample:

...
Check the button labelled "WARNING"
...
James Brown
  • 36,089
  • 7
  • 43
  • 59
1

If you want to put your input data in a 3x3 table you can do it with awk:

awk -v FS=',[^ ]' -v OFS="|" '{print $1, $2, $3}' file
ABC"|Unfortunately, system has failed"| - Error in system"
"DEF"|Check the button labelled "WARNING""|Warning in system"
"XYZ"|Everything is okay"|No errors"

the code:

  • why setting FS=',[^ ]'? Just that comma in , system is not a separator.
Carlos Pascual
  • 1,106
  • 1
  • 5
  • 8