0

Got a solution to format a unix file containing ^M and "\r\n" in a file as per shared link earlier "https://stackoverflow.com/questions/68919927/removing-new-line-characters-in-csv-file-from-inside-columns-in-unix" .

But current ask is to get rid of "\r\n" and ^M characters in all column of unix file except last one { so last column "\r\n" along with ^M character value cna be used to format the file using command awk -v RS='\r\n' '{gsub(/\n/,"")} 1' test.csv }

sample data is ::

$ cat -v test.csv
234,aa,bb,cc,30,dd^M

22,cc,^M

ff,dd,^M

40,gg^M

pxy,aa,,cc,^M

40

,dd^M

Current Output::

234,aa,bb,cc,30,dd

22,cc,

ff,dd,

40,gg

pxy,aa,,cc,

40,dd

Expected output::

234,aa,bb,cc,30,dd

22,cc,ff,dd,40,gg

pxy,aa,,cc,40,dd
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Eja
  • 45
  • 1
  • 7
  • Do you **really** have a blank line between each data line in your input? How can we tell by looking at the input which `\r\n`s to keep and which to delete? You say `get rid of "\r\n" and ^M characters in all column of unix file except last one` but how do we know which is the last one - is it based on there being a fixed number of columns, e.g. 6? – Ed Morton Jan 31 '22 at 15:09

1 Answers1

2

Would you please try a perl solution:

perl -0777 -pe 's/\r?\n(?=,)//g; s/(?<=,)\r?\n//g; 's/\r//g; test.csv

Output:

234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
  • The -0777 option tells perl to slurp all lines including line endings at once.
  • The -pe option interprets the next argument as a perl script.
  • The regex \r?\n(?=,) matches zero or one CR character followed by a NL character, with a positive lookahead for a comma.
  • Then the substitution s/\r?\n(?=,)//g removes the line endings which matches the condition above. The following comma is not removed due to the nature of lookaround assertions.
  • The substitution s/(?<=,)\r?\n//g is the switched version, which removes the line endings after the comma.
  • The final s/\r//g removes still remaining CR characters.

[Edit]
As the perl script above slurps all lines into the memory, it may be slow if the file is huge. Here is an alternative which processes the input line by line using a state machine.

awk -v ORS="" '                 # empty the output record separator
    /^\r?$/ {next}              # skip blank lines
    f && !/^,/ {print "\n"}     # break the line if the flag is set and the line does not start with a comma
    {
        sub(/\r$/, "")          # remove trailing CR character
        print                   # print current line (w/o newline)
        if ($0 ~ /,$/) f = 0    # if the line has a trailing comma, clear the flag
        else f = 1              # if the line properly ends, set the flag
    }
    END {
        print "\n"              # append the newline to the last line
    }
' test.csv

BTW if you want to put blank lines in between as the posted expected output which looks like:

234,aa,bb,cc,30,dd

22,cc,ff,dd,40,gg

pxy,aa,,cc,40,dd

then append another \n in the print line as:

    f && !/^,/ {print "\n\n"}
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • tried this perl command, but its still running since 1 hour. Need to format file with huge data set with 30 + columns. Can we somehow look for some option in AWK, that should be working quickly i suppose. – Eja Jan 27 '22 at 14:59
  • Thank you for the feedback and sorry for the inconvenience. The posted perl script slurps all lines into the memory and may be slow if the file is huge. I have updated my answer adding an awk alternative. BR. – tshiono Jan 27 '22 at 22:36
  • Provided awk command doesn't seems to be working out. Its not even covering the earlier AWK command work. awk -v RS='\r\n' -v sep=# '{gsub(/\n/,sep)} 1' test.csv > test_tmp.csv. this AWK command deals with new line characters in between the column. But unable to deal newline characters along with Control M characters in between columns value – Eja Jan 28 '22 at 14:41
  • I had assumed the blank lines between the lines in the posted sample are jammed in by some editing process and had dropped them as my testing input. If they *do* exist, my awk script will fail as your comment. I've updated my awk script accordingly. Would you please try it? Sorry for bothering you again. – tshiono Jan 29 '22 at 02:21
  • Sorry, adding another \n in print line is also not helping. is this not about \r\n altogether. ? – Eja Jan 31 '22 at 06:58
  • Did you update the script by adding the 2nd line `/^\r?$/ {next}`as shown in my answer? – tshiono Jan 31 '22 at 07:19
  • tried by adding the 2nd line, but this seems to be splitting the line , instead of normalizing the data. Started getting "▒| ▒| - ▒|" characters as prefix for each new line now. – Eja Feb 03 '22 at 16:13
  • Sorry but I have no idea why you are having such weird results. I'm afraid you may not be executing my posted script properly. Please read carefully my script with the embedded comments. If you are not understanging what each line of the script is doing, I may not be your help. – tshiono Feb 03 '22 at 22:28