0

This is less common but valid CSV file with 6 records (5th record is empty):

Name(s),Year,CreateDate
Peter,1960,2017-09-26
"Smith, John",,㏹㋈2017
"Kevin ""Kev"" McRae",,,fourthColumn

"Pam,
Sandra
and Kate","
",26.9.2017

Is it possible to recognize its columns and records properly using awk/gawk so for example

  • in 4th record, $4 = fourthColumn
  • it 5th record, $1 is zero-length string
  • in 6th record, $1 = Pam,↵Sandra↵and Kate

My question is how to correctly obtain values into $1..$n for every record?

I was able to properly parse this file by writing finite-state machine in universal language (I used .NET). But is there a way of proper parsing using strengths of the awk?

Alternative: Should the new line inside value Pam,↵Sandra↵and Kate be the largest obstacle, maybe you can propose a solution on the above sample where is replaced by string {newline}, i.e. Pam,↵Sandra↵and Kate will become Pam,{newline}Sandra{newline}and Kate. I am often doing this as preprocessing so it is acceptable.


Edit: As requested in comment, this is the example of processing properly recognized fields and records where:

  • field separator , was replaced with ; (preferably using awk's OFS)

  • last column of every record was duplicated at the beginning of the record

Output:

CreateDate;Name(s);Year;CreateDate
2017-09-26;Peter;1960;2017-09-26
㏹㋈2017;"Smith, John";;㏹㋈2017
fourthColumn;"Kevin ""Kev"" McRae";;;fourthColumn
;
26.9.2017;"Pam,
Sandra
and Kate";"
";26.9.2017
miroxlav
  • 11,796
  • 5
  • 58
  • 99
  • post the final expected result – RomanPerekhrest Sep 26 '17 at 09:02
  • this line `fourthColumn;"Kevin ""Kev"" McRae";;;fourthColumn` is unclear. It has 5 columns while the other lines - 4 columns – RomanPerekhrest Sep 26 '17 at 09:19
  • it's not about that field at all. This `"Kevin ""Kev"" McRae"` is a single field, that's obvious. I'm saying about NUMBER of fields, cause this line `fourthColumn;"Kevin ""Kev"" McRae";;;fourthColum` has 5 fields according to the number of delimiters – RomanPerekhrest Sep 26 '17 at 09:27
  • @RomanPerekhrest – OK, I did not catch your edit of the comment. So the field counts over sample 6 records are 3, 3, 3, 4, 1, 3 and result therefore has 4, 4, 4, 5, 2, 4 because in example I added copy of last field of each record at the start of the record. – miroxlav Sep 26 '17 at 09:30
  • and the last question: those last 4 lines - would they have fixed format or there could be other sophisticated variations (which should mentioned in the question) ? – RomanPerekhrest Sep 26 '17 at 09:51
  • @RomanPerekhrest – in my reality, I meet CSV files where quotes around fields are not mandatory, but once the quote started the field, its value can employ `""` or `CR`/`LF`/`CR+LF` until another `"` appears. There can be variable number of fields in the record. Any other tricks like `abc"def"` are not generally supported by CSV specs and can be evaluated as invalid (processing can stop or return garbage). Any field can sometimes contain UTF-8 chars but I think this can be ignored as long as it is transparently copied to output. So if the above example will work, every aspect should be safe. – miroxlav Sep 26 '17 at 10:02
  • I can suggest working Python solution. Let me know if you are ready for it. – RomanPerekhrest Sep 26 '17 at 11:05
  • @RomanPerekhrest – thank you very much. While I'd definitely appreciate that for learning Python, the main goal of this question was to check whether awk with its efficient text-processing features can make parsing in this case somewhat more light-weight than .NET or Java or Python. Newly added answer which says informed "no" is what I needed to hear. :) If you are also thinking the same, then I see I learned something more about the practical scope of the awk. – miroxlav Sep 26 '17 at 11:13
  • with awk I can easily process the first 5 records of your input, but the last 4 are pretty arbitrary - and there's no short , quick and ideal awk solution for such case – RomanPerekhrest Sep 26 '17 at 11:19
  • @RomanPerekhrest – Roman, if I lift that criteria, would you be able to post that awk code? In the original input data, just replace `Pam,↵Sandra↵and Kate` with `Pam,{newline}Sandra{newline}and Kate` (this is what I often do during preprocessing) and then I am interested in your solution. I'll consider accepting it as the asnwer. – miroxlav Sep 26 '17 at 12:04
  • add that variation into your question as an additional note. Also, you may add `python` tag, then, I would post both solutions – RomanPerekhrest Sep 26 '17 at 12:05
  • @RomanPerekhrest – added – miroxlav Sep 26 '17 at 12:10
  • 1
    Any `Newly added answer which says informed "no"` is simply wrong. Of course you can do this fairly easily (i.e. with a function of about 20 lines) with awk. See https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk – Ed Morton Sep 26 '17 at 14:51
  • @miroxlav, I've had being away for a few hours and now, the question is closed. Can't post anything, unfortunately ... Though, it may be reposted with single `python` tag if needed – RomanPerekhrest Sep 26 '17 at 15:07
  • @RomanPerekhrest – thanks for your interest Roman. As we can see, there is now a canonical Q/A linked to this one which basically performs most of the stuff using awk. But sure, should I go more into python, I'll post with its tag. – miroxlav Sep 26 '17 at 15:43
  • s/basically performs most of the stuff/does everything required/ :-). – Ed Morton Sep 26 '17 at 16:47

2 Answers2

1

I was able to properly parse this file by writing finite-state machine in universal language (I used .NET). But is there a way of proper parsing using strengths of the awk?

Since awk is a fully turing complete programming language you can write your parser in awk, sure. But there is no simple one line awk trick to parse this input.

I'd recommend to stick with your .NET implementation. But also look at available CSV parsing libraries (for whatever programming language, for example Python). As long as your input is valid csv, they should be able to handle it.

hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Very appreciated, thank you, this has great value of learning of awk scope. I lifted the criteria slightly (see edited question) so newline character isn't necessarily contained inside the field value. (I'm often able to achieve that using awk preprocessing if first/last column contains value reliably recognizable by a regex.) Maybe this change could help to unleash power of the awk. – miroxlav Sep 26 '17 at 12:13
  • Yeah, the newlines were concerning me. When you can use `gawk` there is the `FPAT` variable which can be used to describe how a field looks like (with being able to specify alternatives). Let me check if I can find a nice way to process the example with lifted criteria. – hek2mgl Sep 26 '17 at 12:28
  • @miroxlav To be honest, event the lifted criteria are hard to parse. To answer your question: No, there is no simple way to parse this input with awk. Of course, it is possible, but not in the simple way you might expect. Btw, are you sure that the input is really valid csv? – hek2mgl Sep 26 '17 at 12:58
  • Thanks for the reply. Regarding valid CSV: I know that variable count of columns could be objected as well as empty line. From syntactic viewpoint, they shouldn't matter too much. Thank you for trying with awk anyway. – miroxlav Sep 26 '17 at 13:28
  • Sure, but what's with `"a""b""c"` are these 3 or one field. Also you seem to mix many dialects in a single (this btw doesn't seem like a real world use case to me) – hek2mgl Sep 26 '17 at 13:43
  • `"a""b""c"` mean `a"b"c` after decoding, see [Basic rules](https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules). Double quotes, UTF-8 and newline characters in fields are common in my files. Multiple date formats are just random example. – miroxlav Sep 26 '17 at 15:47
0

As @hek2mgl says, the proper way to handle csv files is with finite state machine. I wrote one several years ago, in C, and still use it to this day. See https://groups.google.com/d/msg/comp.lang.awk/1aMF1Z3TG7k/-f9wtRQVloYJ.

user448810
  • 17,381
  • 4
  • 34
  • 59