This is less common but valid CSV file with 6 records (5th record is empty):
Name(s),Year,CreateDate
Peter,1960,2017-09-26
"Smith, John",,㏹㋈2017
"Kevin ""Kev"" McRae",,,fourthColumn
"Pam,
Sandra
and Kate","
",26.9.2017
Is it possible to recognize its columns and records properly using awk/gawk so for example
- in 4th record,
$4
=fourthColumn
- it 5th record,
$1
is zero-length string - in 6th record,
$1
=Pam,↵Sandra↵and Kate
My question is how to correctly obtain values into $1
..$n
for every record?
I was able to properly parse this file by writing finite-state machine in universal language (I used .NET). But is there a way of proper parsing using strengths of the awk?
Alternative: Should the new line inside value Pam,↵Sandra↵and Kate
be the largest obstacle, maybe you can propose a solution on the above sample where ↵
is replaced by string {newline}
, i.e. Pam,↵Sandra↵and Kate
will become Pam,{newline}Sandra{newline}and Kate
. I am often doing this as preprocessing so it is acceptable.
Edit: As requested in comment, this is the example of processing properly recognized fields and records where:
field separator
,
was replaced with;
(preferably using awk'sOFS
)last column of every record was duplicated at the beginning of the record
Output:
CreateDate;Name(s);Year;CreateDate
2017-09-26;Peter;1960;2017-09-26
㏹㋈2017;"Smith, John";;㏹㋈2017
fourthColumn;"Kevin ""Kev"" McRae";;;fourthColumn
;
26.9.2017;"Pam,
Sandra
and Kate";"
";26.9.2017