2

Sample input Data::

cat test.csv
234,aa,bb,cc,30,dd

22,cc,

ff,dd,

40,gg

pxy,aa,,cc,

40

,dd

Required output ::

cat test_new.csv
234,aa,bb,cc,30,dd

22,cc,ff,dd,40,gg

pxy,aa,,cc,40,dd

I tried various awk and sed commands, nothing seems to be working out.

eg:: awk 'NR == 1{ printf $0; next } { printf "%s%s", (/^[0-9]+,/? ORS : ""), $0 } END{ print "" }'

Above awk looks only for numeric characters with "," , but this will create extra line if numeric comes in a column with new line character.

Input file contains Control M characters, that I'm already removing using dos2unix

U. Windl
  • 3,480
  • 26
  • 54
Eja
  • 45
  • 1
  • 7
  • Hi I'm getting new line characters in between columns of a csv file. Is there a way to remove new line characters inside columns as well in unix CSV file. tried this AWK command awk 'NR == 1{ printf $0; next } { printf "%s%s", (/^[0-9]+,/? ORS : ""), $0 } END{ print "" }' , this works 99 % good, but cannot handle numbers that comes in file with new line character in between columns, and create a new line for same – Eja Aug 25 '21 at 10:15
  • So, you basically want to replace each sequence __ by just a comma? Did you consider trying `sed`? – user1934428 Aug 25 '21 at 11:30
  • its not just replacing comma and newline with just a comma , its like replacing new line characters inside two commas. and this should not impact line formatting as well. this would be reading a CSV file which is having CLOB columns inside , and in those CLOB columns we have new line characters to be handled – Eja Aug 25 '21 at 11:49
  • But isn't this equivalent to what I said? Replacing comma-newline is eqivalent to removing those newlines which follow a comma, and at least for your input, this would provide exactly the output you are requesting. – user1934428 Aug 25 '21 at 11:54
  • i provided a sample data, but in actual data , we might not have comma with the new line all the time. take this as second or third column is a clob column in excel, where we have a paragraph with a lot of new line characters without comma. when we read this file in csv format. it shows one single line splitted in 8-10 lines – Eja Aug 25 '21 at 12:00
  • I see the point, but then I don't quite understand how you can recognize, when a line for output is ready to be completed. Are you sure that your input is **valid** CSV? I thought that in valid CSV, multiline fields have to be quoted. If you are sure that this is syntactically correct CSV, you could use a programming language which comes with a CSV parser (I'm using Ruby for this), and read the file, and then you loop over each field and simply remove the newline characters, and finally you write out the updated CSV line. – user1934428 Aug 25 '21 at 12:04
  • I just made a quick experiment: When I export multi-line fields with Excel, they **are** enclosed in quotes, so I would say that your input is not valid CSV. – user1934428 Aug 25 '21 at 12:11
  • Please read https://stackoverflow.com/help/formatting and look at some other questions that have been upvoted and answered to see how to format your code blocks and sample input/output, then run `cat -v` instead of just `cat` on your input and expectd output files and post the output of THAT in your question so we can see the "Control Ms" (obviously do that BEFORE running `dos2unix` on your file). – Ed Morton Aug 25 '21 at 13:21

1 Answers1

2

Regarding "Input file contains Control M characters, that i'm already removing using dos2unix" - that's probably making your life harder. There's a good chance that the "newline"s within a line are just LFs while the record endings are CRLF and so by removing the CRs we can no longer simply set the awk RS to CRLF and then remove any LFs remaining in the record.

If your file was exported from Excel as you say in the comments under your question then, whether the fields are quoted or not, it'll use CRLF at the end of each record, and LF alone within each field which looks like this if you just don't run dos2unix on it:

$ cat -v test.csv
234,aa,bb,cc,30,dd^M
22,cc,
ff,dd,
40,gg^M
pxy,aa,,cc,
40
,dd^M

so then using any awk:

$ awk -v ORS= '{sub(/\r$/,"\n")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd

or, if you prefer, using GNU awk for multi-char RS:

$ awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd

For more information on parsing CSV with awk see What's the most robust way to efficiently parse CSV using awk?.

Regarding your original script - never use printf $0, always use printf "%s", $0 instead since the former will fail when $0 contains any printf formatting chars such as %s:

$ echo 'a%sb' | awk '{printf "%s", $0}'
a%sb$

$ echo 'a%sb' | awk '{printf $0}'
awk: cmd. line:1: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format string
        `a%sb'
          ^ ran out for this one
$

That applies to using printf on all input data.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    thanks for the info Ed, this worked for me awk -v ORS= '{sub(/\r$/,"\n")} 1' – Eja Aug 25 '21 at 15:41
  • wanted to know, in case same input file doesn't have control M characters in it, will this same command will work fine ? – Eja Aug 25 '21 at 15:42
  • You're welcome. No, it's specifically using control Ms (aka CRs aka `\r`s) to separate `LF`s (aka `\n`s) that occur at the end of a record (`CRLF`) from those that occur mid-record (`LF` alone). If you don't have `CR`s then you need a completely different solution that either relies on `LF`s only appearing in quoted fields or knowing in advance how many fields should be present on each line. – Ed Morton Aug 25 '21 at 16:13
  • tried the same awk command on a different file , that have '"' and '-' in between the file columns and starting of the file. somehow data is not getting sorted properly there – Eja Aug 26 '21 at 17:02
  • Sorry, I don;t know what that means, ask a new question and include those cases in the sample input/output. – Ed Morton Aug 26 '21 at 17:19
  • HI Ed Mortan, can you help me out on the above shared concern. Raised a new question for same as :: https://stackoverflow.com/questions/70875550/removing-new-line-r-n-and-m-characters-in-all-column-except-last-one-in-un – Eja Jan 31 '22 at 08:42