Using awk to parse a delimiter with variable text followed by delimiter

Question

Attempting to manipulate a messy 4GB txt file with a mixture of csv formatting. Data has around 38 'columns' as defined by the delimiter '"'. (Example to follow) The data was exported using commas as a field separator, but also has commas inline with data which makes it difficult to import into most platforms. Ultimately I believe using awk/sed/cat I can fix the data. Each column of data can be defined by using the quotations marks. I just can't figure out how.

I want my end file to be columns identified by what is within the two sets of quotations with all commas replace by a period or similar. The section containing the commas is in the middle of my columns, and is not the last field within the data set. I was attempting to rip out the section with commas with awk, use sed to replace them, and then use cat to paste it back into the file.

Actual data is sensitive and cannot be shared, the below example is akin though.

Data Sample:

"identifier","Status","Name","City","Application","Job","Details","column 39"
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1"
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi, beer, or ham","spare2"
"verde","pending","Duncan","Columbus","65478","CEO","Late for work, on the fifth","spare3"

Desired Result is focused on changing the comma, and adding data back inline or at the end after "column 39"

"identifier","Status","Name","City","Application","Job","Details","column 39"
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1"
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi. beer. or ham","spare2"
"verde","pending","Duncan","Columbus","65478","CEO","Late for work. on the fifth","spare3"

Any suggestions are much appreciated!

Please, post a couple of lines of troublesome data too with the expected output. We don't like to make the test cases ourselves. — James Brown, Sep 26 '17 at 21:51
@JamesBrown apologies for the lack of content, I've added a few lines of examples. — brock.bell, Sep 26 '17 at 22:47

score 0 · Answer 1 · answered Sep 26 '17 at 22:54

You can either use sed to remove the inner commas like

$ f1=$'"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39"'

$ echo "$f1" |sed -r 's/([^"]),([^"])/\1\2/g'
"column 1","Column 2","Name","Address","Application","Job","Comments about items that also have commas inline","column 39"

Or you can replace inner commas with something else and later can be reverted back to inner commas:

$ f2=$(echo "$f1" |sed -r 's/([^"]),([^"])/\1-x2c-\2/g');echo "$f2"     "column 1","Column 2","Name","Address","Application","Job","Comments-x2c- about-x2c- items that also have-x2c- commas-x2c- inline","column 39"
#or use sed -r 's/([^"]),([^"])/\1.\2/g' to replace inner commas with dots

$ echo "$f2" |sed 's/-x2c-/,/g'
"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39"

Or you can use a kind of awk to parse fields based on "," and not just on commas:

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $1}'
"column 1"

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $7}'
"Comments, about, items that also have, commas, inline"

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' -vOFS="," '{print $1,$7}'
"column 1","Comments, about, items that also have, commas, inline"

Thank you George, The second instance worked perfectly and I was able to make sense of it. I did need to switch over to Linux to run, neglected to mention I was running on OSX to start with. — brock.bell, Oct 02 '17 at 23:37

Using awk to parse a delimiter with variable text followed by delimiter

1 Answers1