0

I have almost no experience using awk. I would like to output all records in a csv that have a certain number of columns, but I need to define a column not by delimiter, but by pattern. The command:

"c:\Program Files\Git\usr\bin\awk" -F, 'NF==13' TestInput2.csv

Works to return all records in csv with exactly 13 columns. This, however, only works if there are no column values containing a comma. It does not work for quoted columns where "a,b","c" is two columns, not three.

I know there is FPAT like FPAT='([^,]+)|("[^"]+")'

I need to combine the FPAT and NF to work like the F and NF above, but I cannot figure out how to write a single line awk that works. It would be something like:

"c:\Program Files\Git\usr\bin\awk" FPAT='([^,]+)|("[^"]+")' 'NF==13' TestInput2.csv

However, that does not work.

Per Request in Comment:

Input

"aa","bb","cc,dd","ee","ff,gg""
"aa","bb","cc","dd","ee"
"aa","bb","cc","dd"

Desired output:

"aa","bb","cc,dd","ee","ff,gg""
"aa","bb","cc","dd","ee"

The output is the first two rows because they both have 5 columns The third row is not returned because it only has 4 columns

If I had rows with 1,2,3,4,6,7,... columns they would also not be returned. Notice that the commas in the quotes are not counted as rows which is why row 1 is considered to have five columns.

I hope that helps describe the problem more accurately.

tripleee
  • 175,061
  • 34
  • 275
  • 318
docjosh
  • 85
  • 1
  • 1
  • 8
  • 3
    Looks like you're using Windows. My condolences. If you'd like to make your life easier, install cygwin and then run awk and other UNIX tools under that environment. If you can't do that, save your awk script in a file and run it as `awk -f script foo.csv` to avoid Windows nightmarish quoting rules. See also https://stackoverflow.com/a/45420607/1745001 for how to use FPAT. – Ed Morton Jan 15 '18 at 01:31
  • 1
    maybe you can re-export your data with a unique FS char, `|` is seldom found in user data, and it is nice, because it is visible, and unambiguous, compared to a tab (possibly space) char. Good luck. – shellter Jan 15 '18 at 01:31
  • Please add sample input and your desired output for that sample input to your question. – Cyrus Jan 15 '18 at 03:23

1 Answers1

0

Well, not too bad really. Read a small awk tutorial and then fiddled with regex.

For awk at windows command line, best to use a .awk file. Something like:

"c:\Program Files\Git\usr\bin\awk" -f simawk3.awk TestInput3.csv

That way you don't need to escape things on behalf of the cmd shell.

As far as the code in the .awk file, its very simple:

/regex-here/ { print $0 }

Replace regex-here with your regex that matches a row. Every time a row matches, print $0 will print the row.

Finally, for the regex, I came up with a variety of one-liners, here's one:

/^(((([^",]|\,|\"))|"(([^"\]|\"|\\))"),){12}(((([^",]|\,|\"))|"(([^"\]|\"|\\))"))$/

That one finds records with 13 columns with or without quotes around the column value - if quoted then the value can contain escaped quotes, escaped escapes or commas - if not quoted then the value can contain escaped quotes or escaped commas or non-escaped escapes such as the two, 13-column, records below

"c_big,in\"t","c_bit","c_\\"ch,ar_10","c_datetime2_7","c_decimal_18_2","c_float","c_int","c_money","c_nchar\,_10",c_nvar\schar_max,"c_uniq\",ueidentifier","c_varchar_50",c_v\\a\"r\,\,\,ch

c_big\,in\"t,"c_bit","c_\\"ch,ar_10","c_datetime2_7","c_decimal_18_2","c_float","c_int","c_money","c_nchar\,_10",c_nvar\schar_max,"c_uniq\",ueidentifier","c_varchar_50",c_v\\a\"r\,\,\,ch

In the regex above, I didn't worry about accurately capturing each field value, since I just need to verify the count of fields and output those records that have the correct number of fields, but I am sure it can be done.

There are likely many other ways to do it. I hope this helps anyone who had a similar need.

docjosh
  • 85
  • 1
  • 1
  • 8
  • FPAT works perfectly with NF, the example you give, what you're trying to get working, is exactly what's in the awk docs for FPAT. Your problem is what Ed Morton's comment says it is: the Windows shell's [nightmarish quoting rules](https://blogs.msdn.microsoft.com/twistylittlepassagesallalike/2011/04/23/everyone-quotes-command-line-arguments-the-wrong-way/). – jthill Jan 15 '18 at 21:09