-4

I have the below line in a file:

~Test1~, ~Test2~,,,, ~Test3, Test4~, ~Test5~

This should be interpreted 7 columns as the comma between ~Test3 and Test4~ is data, not a delimiter.

I want to have a dynamic script in unix that will check the number of columns (7) based on the field delimiter, in this case ',' and to ignore that in one column exists a text with comma. The separator can be replaced during the process.

I think a solution in sed would be to change the separator from comma into a semicolon ';' which would make the output: ~Test1~; ~Test2~;;;;~Test3, Test4~; ~Test5

TomDunning
  • 4,829
  • 1
  • 26
  • 33
Mathew Linton
  • 33
  • 1
  • 2
  • 8
  • Please clarify what is a 'column' or 'row' in your case. Provide example input, desired output, and what you have tried. – dawg Oct 09 '18 at 18:14
  • 1
    @dawg My interpretation: It's CSV with `~` as the quoting character (for some reason). OP wants to validate the number of columns using awk (for some reason). – melpomene Oct 09 '18 at 18:18
  • @melpomene: I agree mostly, but the example has 7 rows, not 7 columns. Enough ambiguity to make answering a waste of time – dawg Oct 09 '18 at 18:30
  • @dawg Where do you see rows? It's 7 columns, all in one line. – melpomene Oct 09 '18 at 18:32
  • In the second block. – dawg Oct 09 '18 at 18:33
  • 1
    @dawg That's just for illustration, showing explicitly what the 7 columns are. – melpomene Oct 09 '18 at 18:35
  • 1
    This is a good candidate for using a proper CSV parser, but you need to have *only* comma as the field separator, not "comma space" – glenn jackman Oct 09 '18 at 18:50

1 Answers1

1

IF you had consistent csv, without a space, you can use Ed Morton's FPAT approach with GNU awk:

$ echo '~Test1~,~Test2~,,,,~Test3, Test4~,~Test5~' | 
        gawk -v FPAT='[^,]*|~[^~]+~' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}'
1 <~Test1~>
2 <~Test2~>
3 <>
4 <>
5 <>
6 <~Test3, Test4~>
7 <~Test5~>

For your example, you can modify that regex to take into account the inconsistent spacing by actually capturing then removing the space and the comma:

$ echo "~Test1~, ~Test2~,,,, ~Test3, Test4~, ~Test5~" | 
    gawk -v FPAT="([ ]?~[^~]+~,?)|([^,]*,)" '{for (i=1; i<=NF;i++) {sub(/,$/,"", $i); sub(/^ /,"",$i); print i, "<" $i ">"}}'
1 <~Test1~>
2 <~Test2~>
3 <>
4 <>
5 <>
6 <~Test3, Test4~>
7 <~Test5~>

Since you example does have inconsistent spacing between commas, you could use Ruby's csv parser:

$ ruby -e 'require "csv"
         options={:col_sep=>", ", :quote_char=>"~"}
         CSV.parse($<, **options){ |r| p r}' <<<    '~Test1~, ~Test2~, , , , ~Test3, Test4~, ~Test5~'
["Test1", "Test2", nil, nil, nil, "Test3, Test4", "Test5"]
dawg
  • 98,345
  • 23
  • 131
  • 206