1

An input file is given, each line of which contains quotes for each column and carriage return/ new line character.

  • If the line contains new lines it has be appended with in the same line which is inside the quotes i.e for example line 1

  • Removing of double quotes for each column if the delimiter(,) is not present.

  • Removing of Carriage Return characters i.e(^M)

To exemplify, given the following input file

"name","address","age"^M
"ram","abcd,^M
def","10"^M
"abhi","xyz","25"^M
"ad","ram,John","35"^M

I would like to obtain the following output by means of a sed/perl/awk script/oneliner.

name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

Solutions which i have tired it so far For appending with previous line

sed '/^[^"]*"[^"]*$/{N;s/\n//}' sample.txt

for replacing control-m characters

perl -pne 's/\\r//g' sample.txt

But i didn't achieve final output what i required below

zdim
  • 64,580
  • 5
  • 52
  • 81
user1485267
  • 1,295
  • 2
  • 10
  • 19
  • 3
    To get rid of `^M` you can do `dos2unix sample.txt` – Jotne Sep 03 '19 at 05:44
  • 8
    This is CSV format data, you probably DON'T want to get rid of the quotes. This feels like an [XY Problem](http://xyproblem.info). Tell us what you're _REALLY_ trying to accomplish, i.e. why you think you need to remove the quotes. – Jim Garrison Sep 03 '19 at 05:46
  • 1
    `cat yourfile | dos2unix | tr -d '"'` – David C. Rankin Sep 03 '19 at 05:56
  • David tr with removing double quotes it removes all quotes ? – user1485267 Sep 03 '19 at 06:09
  • @david : I want to achieve the output what is shown above. I want to remove quotes where the delimiter value is not present in the data – user1485267 Sep 03 '19 at 06:10
  • I got you, sorry, my quick look looked like you wanted to remove all double-quotes. – David C. Rankin Sep 03 '19 at 06:13
  • 1
    Are you SURE there's `^M`s (`\r`s) inside the quoted fields? If this were a file exported from Excel, for example, then there'd be `\r\n` at the end of each record but inside the quoted fields it'd only be `\n` for the newline, not `\r\n`. In any case, to handle it with awk see [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk). – Ed Morton Sep 03 '19 at 15:02

4 Answers4

5

Use a library to parse CSV files. Apart from always wanting to use a library for that here you also have very specific reasons, with embedded newlines and delimiters.

In Perl a good library is Text::CSV (which wraps Text::CSV_XS if installed). A basic example

use warnings;
use strict;
use feature 'say';

use Text::CSV;

my $file = shift or die "Usage: $0 file.csv\n";

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 }); 

open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $row = $csv->getline($fh)) { 
    s/\n+//g for @$row; 
    $csv->say(\*STDOUT, $row);
}

Comments

  • The binary option in the constructor is what handles newlines embedded in data

  • Once a line is read into the array reference $row I remove newlines in each field with a simplistic regex. By all means please improve this as/if needed

  • The pruning of $row works as follows. In a foreach loop each element is really aliased by the loop variable, so if that gets changed the array changes. I used default where elements are aliased by $_, which the regex changes so $row changes.

    I like this compact shortcut because it has such a distinct look that I can tell from across the room that an array is being changed in place; so I consider it a sort-of-an-idiom. But if it is in fact confusing please by all means write out a full and proper loop

  • The processed output is printed to STDOUT. Or, open an output file and pass that filehandle to say (or to print in older module versions) so the output goes directly to that file

The above prints, for the sample input provided in the question

name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
zdim
  • 64,580
  • 5
  • 52
  • 81
2

FPAT is the way to go using gnu awk, it handles comma separated files.

  1. remove ^m
  2. clean lines
  3. remove qutes

.

dos2unix sample.txt  

awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt > tmp && mv tmp sample.txt
"name","address","age"
"ram","abcd,def","10"
"abhi","xyz","25"
"ad","ram,John","35"


awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt
name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35

All in one go:

dos2unix sample.txt && awk '{printf "%s"(/,$/?"":"\n"),$0}' sample.txt | awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1'

Normally you set Filed Separator FS or F to tell how filed are separated. FPAT="([^,]+)|(\"[^\"]+\")" FPAT tells how the filed looks like using a regex. This regex is complicated and often used with CSV.

  • (i=1;i<=NF;i++) loop through on by one field on the line.
  • if($i!~",") if it does not contain comma, then
  • $i=substr($i,2,length($i)-2) remove first and last character, the "

If a field for some reason do not contain ", this is more robust:

awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") {n=split($i,a,"\"");$i=(n>1?a[2]:$i)}}1' file

It will not do any thing to a field not contains double quote.

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • 1
    Thanks for your detailed explanation. Can you explain me the last pattern which you have applied. awk -v FPAT="([^,]+)|(\"[^\"]+\")" -v OFS=, '{for (i=1;i<=NF;i++) if($i!~",") $i=substr($i,2,length($i)-2)}1' sample.txt – user1485267 Sep 03 '19 at 06:20
  • @user1485267 added some more info – Jotne Sep 03 '19 at 06:28
  • Thanks for the info. But with third function it doesn't work i got same ouptut what i achived in second step – user1485267 Sep 03 '19 at 07:21
  • Currently awk version which i am trying is GNU Awk 3.1.7 – user1485267 Sep 03 '19 at 07:28
  • Jotne - the OP has a newline within a quoted field, you can't handle that with just FPAT. @user1485267 that version of gawk is 10 years out of date (it was released in 2009), we're now on gawk 5.0.1 - you should update your version as you're missing a TON of great functionality (including FPAT) and some bug fixes, see https://www.gnu.org/software/gawk/manual/gawk.html#Feature-History for the list of features added per release and https://ftp.gnu.org/gnu/gawk/ for the release dates. – Ed Morton Sep 03 '19 at 17:32
  • 1
    @EdMorton That is why I added two awk :) – Jotne Sep 03 '19 at 19:23
  • Ah, I see. I didn't get that the first script was trying to combine lines. Got it, thx. – Ed Morton Sep 03 '19 at 19:33
  • @Jotne : Sorry i cannot update the version. I have to try with the current version itself whatever the possibilities are there – user1485267 Sep 04 '19 at 04:14
  • @user1485267 sounds that you are on an outdated system and should upgrade all. – Jotne Sep 04 '19 at 04:33
2

This might work for you (GNU sed):

sed ':a;/[^"]$/{N;s/\n//;ba};s/"\([^",]*\)"/\1/g' file

The solution is in two parts:

  1. Join broken lines to make whole ones.
  2. Remove double quotes surrounding fields that do not contain commas.

If the current line does not end with double quotes, append the next line, remove the newline and repeat. Otherwise: remove double quotes surrounding fields that do not contain double quotes or commas.

N.B. Supposes that fields do not contain quoted double quotes. If that is the case, the condition for the first step would need to be amended and double quotes within fields would need to catered for.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Thanks for your solution. If i am not sure how many columns come in future. in that case the solution provided by you it wont be feasible. – user1485267 Sep 03 '19 at 09:20
  • @user1485267 I over thought the problem, see amended simplified solution. – potong Sep 03 '19 at 09:23
1

With perl, please try the following:

perl -e '
while (<>) {
    s/\r$//;      # remove trailing CR code
    $str .= $_;
}

while ($str =~ /("(("")|[^"])*"\n?)|((^|(?<=,))[^,]*((?=,)|\n))/g) {
    $_ = $&;
    if (/,/) {    # the element contains ","
        s/\n//g;  # then remove newline(s) if any
    } else {      # otherwise remove surrounding double quotes
        s/^"//s; s/"$//s;
    }
    push(@ary, $_);

    if (/\n$/) {  # newline terminates the element
        print join(",", @ary);
        @ary = ();
    }
}' sample.txt

Output:

name,address,age
ram,"abcd,def",10
abhi,xyz,25
ad,"ram,John",35
tshiono
  • 21,248
  • 2
  • 14
  • 22