Replace character or fixing incorrect new line character in csv using UNIX command

Question

I have a Pipe-delimited CSV from a company report production. But there is a 'comment' field that the data inputted randomly by the employee and causes failed to load the data into database. How do i solve it using UNIX command or shell script?

The data sample is look like this :

Employee ID|Time Type|Start Date|End Date|Number Of Days|Comment|Approved
90006731|Leave|04/02/2019|04/02/2019|1|annual leaves|Y
90005267|Leave|04/02/2019|04/02/2019|1||Y
90007366|Leave|04/02/2019|04/02/2019|1|* Take care of vehicle taxes 
* Vehicle Repair 
* Community service
* Swimming|Y
90005052|Leave|04/02/2019|04/02/2019|1|Son's field trip|Y
90006253|Death of Wife/Husband/Child/Parent|04/01/2019|04/02/2019|2||Y
90007595|Leave|04/01/2019|04/01/2019|1|family plan|Y
90004064|Leave|08/18/2020|08/21/2020|3|Dear Mas Rama,
Please kindly approve, Mas Okto was oke.
Thanks.|Y
90007072|Sick Leave Without Certificate|04/01/2019|04/01/2019|1|Sick leave due to eye swelling|Y
90004371|Sick Leave|04/01/2019|04/05/2019|4||Y
90007431|Sick Leave|04/01/2019|04/01/2019|1||Y

Required Output :

Employee ID|Time Type|Start Date|End Date|Number Of Days|Comment|Approved
90006731|Leave|04/02/2019|04/02/2019|1|annual leaves|Y
90005267|Leave|04/02/2019|04/02/2019|1||Y
90007366|Leave|04/02/2019|04/02/2019|1|* Take care of vehicle taxes * Vehicle Repair * Community service * Swimming|Y
90005052|Leave|04/02/2019|04/02/2019|1|Son's field trip|Y
90006253|Death of Wife/Husband/Child/Parent|04/01/2019|04/02/2019|2||Y
90007595|Leave|04/01/2019|04/01/2019|1|family plan|Y
90004064|Leave|08/18/2020|08/21/2020|3|Dear Mas Rama, Please kindly approve, Mas Okto was oke. Thanks.|Y
90007072|Sick Leave Without Certificate|04/01/2019|04/01/2019|1|Sick leave due to eye swelling|Y
90004371|Sick Leave|04/01/2019|04/05/2019|4||Y
90007431|Sick Leave|04/01/2019|04/01/2019|1||Y

I have tried [this][1],

awk -F\| '{ while (NF < 7 || $NF == "") { brokenline=$0; getline; $0 = brokenline $0}; print }' cu_inf_20200902tst.csv > cu_inf_20200902tst1.csv

but i found the error warning :

awk: cmd. line:1: (FILENAME=cu_inf_20200902tst.csv FNR=19) fatal: grow_fields_arr: fields_arr: can't allocate 321069040 bytes of memory (Cannot allocate memory)

Any suggest to fix my problem? [1]: https://unix.stackexchange.com/questions/434979/fixing-malformed-csv-with-incorrect-new-line-chars-using-sed-or-perl-only

Kindly do add your efforts in your question in form of code and let us know, it's highly encouraged on SO, thank you. — RavinderSingh13, Sep 03 '20 at 04:48
@MuhammadAbdulRifqi : If the comment field is present sometimes, and missing sometimes, the file is not valid CSV. According to the CSV headers, every line must include a comment field. If you really want to process this file, instead of rejecting it plainly, I would preprocess it by inserting an empty comment field at the appropriate place. — user1934428, Sep 03 '20 at 11:03
Can the comment data ever contain a pipe symbol? Is that escaped? Do fields ever get quoted? You need software that can count the number of pipe symbols in a line, and read the next line if there aren't enough pipe symbols yet. That sort of stuff gets tricky. — Jonathan Leffler, Sep 03 '20 at 18:08

RavinderSingh13 · Answer 1 · 2020-09-03T06:13:09.850

Considering that your first field id will be always 8 digits and no other field will be of 8 digits like it, if this is the case then you could try following.

awk '
{
  printf("%s%s", (FNR>1 ? (/^[0-9]{8}/?ORS:OFS) : ""), $0)
}
END{
  print ""
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                                   ##Starting awk program from here.
{
  printf("%s%s",(FNR>1?(/^[0-9]{8}/?ORS:OFS) ""), $0)   ##Using printf statement from here where checking condition if line is first line then check if line starts from 8 digits then print new line else print space.
}
END{                                                    ##Starting END block of this program from here.
  print ""                                              ##Printing a new line in here.
}' Input_file                                           ##Mentioning Input_file name here.

OR(in case you want to keep header separate itself then keep its condition separate itself)

awk '
FNR==1{
  print
  next
}
{
  printf("%s%s",$0!~/^[0-9]{8}/?OFS:(FNR>2?ORS:""),$0)
}
END{
  print ""
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                                     ##Starting awk program from here.
FNR==1{                                                   ##Checking condition if this is first line then do following.
  print                                                   ##Printing current line here.
  next                                                    ##next will skip all further statements from here.
}
{
  printf("%s%s",$0!~/^[0-9]{8}/?OFS:(FNR>2?ORS:""),$0)    ##Using printf for printing where checking condition if line starts from 8 digits then print space else check if line number is more than 2 then print new line or nothing with current line.
}
END{                                                      ##Starting END block of this program from here.
  print ""                                                ##Printing a new line in here.
}' Input_file                                             ##Mentioning Input_file name here.

score 1 · Answer 2 · answered Sep 03 '20 at 05:33

Please try the following.

Sed Solution

sed ':a;N;/\n[0-9]/!s/\n/ /;ta;P;D' input_file

awk Solution

awk '{printf "%s%s", (NR>1 ? (/^[0-9]/?ORS:OFS) : ""), $0} END{print ""}' input_file

Output of the following command

Employee ID|Time Type|Start Date|End Date|Number Of Days|Comment|Approved
90006731|Leave|04/02/2019|04/02/2019|1|annual leaves|Y
90005267|Leave|04/02/2019|04/02/2019|1||Y
90007366|Leave|04/02/2019|04/02/2019|1|* Take care of vehicle taxes  *Vehicle Repair  * Community service * Swimming|Y
90005052|Leave|04/02/2019|04/02/2019|1|Son's field trip|Y
90006253|Death of Wife/Husband/Child/Parent|04/01/2019|04/02/2019|2||Y
90007595|Leave|04/01/2019|04/01/2019|1|family plan|Y
90004064|Leave|08/18/2020|08/21/2020|3|Dear Mas Rama, Please kindly approve, Mas Okto was oke. Thanks.|Y
90007072|Sick Leave Without Certificate|04/01/2019|04/01/2019|1|Sick leave due to eye swelling|Y
90004371|Sick Leave|04/01/2019|04/05/2019|4||Y
90007431|Sick Leave|04/01/2019|04/01/2019|1||Y

for detail explanatiion: How to join lines not starting with specific pattern to the previous line in UNIX?

score 1 · Accepted Answer · answered Sep 03 '20 at 06:39

1

This might work for you (GNU sed):

sed ':a;s/[^|]*/&/7;t;N;s/\n//;ta' file

Check to see if the current line contains 7 fields, if not, append the next line, remove the newline and repeat.

answered Sep 03 '20 at 06:39

potong

55,640
6
51
83

Replace character or fixing incorrect new line character in csv using UNIX command

3 Answers3