Count number of columns of a file in Unix when separator is part of a column value

Question

I have the below line in a file:

~Test1~, ~Test2~,,,, ~Test3, Test4~, ~Test5~

This should be interpreted 7 columns as the comma between ~Test3 and Test4~ is data, not a delimiter.

I want to have a dynamic script in unix that will check the number of columns (7) based on the field delimiter, in this case ',' and to ignore that in one column exists a text with comma. The separator can be replaced during the process.

I think a solution in sed would be to change the separator from comma into a semicolon ';' which would make the output: ~Test1~; ~Test2~;;;;~Test3, Test4~; ~Test5

Please clarify what is a 'column' or 'row' in your case. Provide example input, desired output, and what you have tried. — dawg, Oct 09 '18 at 18:14
@dawg My interpretation: It's CSV with `~` as the quoting character (for some reason). OP wants to validate the number of columns using awk (for some reason). — melpomene, Oct 09 '18 at 18:18
@melpomene: I agree mostly, but the example has 7 rows, not 7 columns. Enough ambiguity to make answering a waste of time — dawg, Oct 09 '18 at 18:30
@dawg Where do you see rows? It's 7 columns, all in one line. — melpomene, Oct 09 '18 at 18:32
@dawg That's just for illustration, showing explicitly what the 7 columns are. — melpomene, Oct 09 '18 at 18:35
This is a good candidate for using a proper CSV parser, but you need to have *only* comma as the field separator, not "comma space" — glenn jackman, Oct 09 '18 at 18:50

dawg · Answer 1 · 2021-05-21T14:54:34.483

IF you had consistent csv, without a space, you can use Ed Morton's FPAT approach with GNU awk:

$ echo '~Test1~,~Test2~,,,,~Test3, Test4~,~Test5~' | 
        gawk -v FPAT='[^,]*|~[^~]+~' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}'
1 <~Test1~>
2 <~Test2~>
3 <>
4 <>
5 <>
6 <~Test3, Test4~>
7 <~Test5~>

For your example, you can modify that regex to take into account the inconsistent spacing by actually capturing then removing the space and the comma:

$ echo "~Test1~, ~Test2~,,,, ~Test3, Test4~, ~Test5~" | 
    gawk -v FPAT="([ ]?~[^~]+~,?)|([^,]*,)" '{for (i=1; i<=NF;i++) {sub(/,$/,"", $i); sub(/^ /,"",$i); print i, "<" $i ">"}}'
1 <~Test1~>
2 <~Test2~>
3 <>
4 <>
5 <>
6 <~Test3, Test4~>
7 <~Test5~>

Since you example does have inconsistent spacing between commas, you could use Ruby's csv parser:

$ ruby -e 'require "csv"
         options={:col_sep=>", ", :quote_char=>"~"}
         CSV.parse($<, **options){ |r| p r}' <<<    '~Test1~, ~Test2~, , , , ~Test3, Test4~, ~Test5~'
["Test1", "Test2", nil, nil, nil, "Test3, Test4", "Test5"]

Count number of columns of a file in Unix when separator is part of a column value

1 Answers1