-1

Have a large csv file with some line feeds in quoted columns. I need to run a shell script, sed, awk, perl are fine and replace only the line feeds inside quotes with a space. The end of line must be preserved and I don't know the number of columns or which fields might have these embedded line feeds.

Further examination of the file shows this as a result of cat -v test_lf.csv

"NORTH ?M-^@?OLMSTED"
"PORT?M-^@?ST?M-^@?LUCIE"

in the csv file which in excel shows a line break where the ?M-^@? is.

I'd like to use tr to replace whatever that is with a space. How would I do that? What is that sequence?

I've now found that a small part of the file looks like the following in characters and in hex.

13:20:59 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -x
0000000      3431    3136    3439    3039    2c39    542c    4c45    3543
0000020      5f36    3430    2c47    4c46    332c    3934    3338    312c
0000040      3634    3931    3934    3930    222c    4f50    5452    80e2
0000060      53a8    e254    a880    554c    4943    2245    462c    2c4c
0000100      3433    3839    2c33    3737    2c32    3237    2c37    3535
0000120      2c2c    5441    334c    2c51    6e43    7463    222c    6f4e
0000140      80e2    4da8    6773    2c22    2c34    3832    312c    0d35
0000160      000a                                                        
0000161
13:21:50 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -c
0000000    1   4   6   1   9   4   9   0   9   ,   ,   T   E   L   C   5
0000020    6   _   0   4   G   ,   F   L   ,   3   4   9   8   3   ,   1
0000040    4   6   1   9   4   9   0   9   ,   "   P   O   R   T   
      **
0000060   **   S   T   
      **  **   L   U   C   I   E   "   ,   F   L   ,
0000100    3   4   9   8   3   ,   7   7   2   ,   7   2   7   ,   5   5
0000120    ,   ,   A   T   L   3   Q   ,   C   n   c   t   ,   "   N   o
0000140    
      **  **   M   s   g   "   ,   4   ,   2   8   ,   1   5  \r
0000160   \n  

I'm on a mac and the file is utf-8, and my locale is utf-8. It appears that the bytes are swapped (little-endian), so hexadecimal 3431 3136 is 1463 in the character representation. So, starting at byte 60 in this output, we have: something, S, T, blank, something, something, L and in hex this is: 53a8 e254 a880 554c, 53 is an S, 54 is a T, and 4c is an L . So, there is an e2, a8, 80 sequence between the T and the L. This produces a line-feed in the Excel spreadsheet field.
How do I find and replace these bytes by a space?

John Wooten
  • 685
  • 1
  • 6
  • 21
  • And you've tried... what, exactly? – Matt Jacob Jun 07 '18 at 20:08
  • awk -F"\"" '!$NF{print;next}{printf("%s ", $0)}' file but this is getting rid of the terminal line feed, it looks like. – John Wooten Jun 07 '18 at 20:10
  • What need be done with that CSV then? To merely correct it there's [glenn's answer](https://stackoverflow.com/a/50749450/4653379) but note that you might as well use the module to read the file in (as shown in the answer) since the module doesn't have a problem with newlines, and then process as you would. – zdim Jun 07 '18 at 20:12
  • It has to be run through a large R program and has to read it first as text, then apply a lot of regex's, then parse the lines that pass that as csv. The reading in as text breaks because of the embedded line feeds. They don't matter as far as the R analysis goes, just the information in them. – John Wooten Jun 07 '18 at 20:15
  • OK. Then @glenn answer fixes those newlines. (Perhaps still consider whether you can offload some of that processing to Perl which is far better for string mangling of all kinds.) – zdim Jun 07 '18 at 20:21
  • In general, see https://stackoverflow.com/q/45420535/1745001 for how to do whatever you like with a CSV using any awk in any shell on any UNIX box. If you provide a [mcve] with concise, testable sample input and expected output and show us what you've tried so far then we can help you. – Ed Morton Jun 07 '18 at 21:01

3 Answers3

3

I'd use the Perl module Text::CSV

#!/usr/bin/perl

use strict;
use warnings;
use feature qw/say/;
use open IO => ':encoding(utf8)';
use open ':std';
use Text::CSV;

my $file = shift @ARGV;
open my $fh, "<", $file or die "cannot open $file: $!\n";

my $csv = Text::CSV->new({binary => 1});

while (my $row = $csv->getline($fh)) {
    my @no_newlines = map {s/\n/ /g; $_} @$row;
    $csv->combine(@no_newlines);
    say $csv->string();
}

close $fh;

And then you can run it like:

/path/to/csvfixer.pl file.csv > fixed.csv
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Where do I get the Text/CSV.pm ? – John Wooten Jun 07 '18 at 20:20
  • https://metacpan.org/pod/release/ISHIGAKI/Text-CSV-1.95/lib/Text/CSV.pm – Jason Jun 07 '18 at 20:24
  • @JohnWooten For every OS there is a system for easy installation of modules (so you don't have to use the tarball from CPAN!). On Linux there is `cpan` (or `cpanm`, easier) or you can get packages for a distro you use (for RHEL it's `Perl-Text-CSV...rpm` for example). Same for Windows (but I'm not familiar with installation tools.) – zdim Jun 07 '18 at 20:26
  • Found ppm on my machine and did a push(@INC,path) before the use Text::CSV; Got this result fix.pl test_lf.csv > fixed.csv Use of /c modifier is meaningless without /g at ./fix.pl line 8. Regexp modifiers "/l" and "/a" are mutually exclusive at ./fix.pl line 8, at end of line Regexp modifier "/l" may not appear twice at ./fix.pl line 8, at end of line syntax error at ./fix.pl line 8, near "/)" BEGIN not safe after errors--compilation aborted at ./fix.pl line 9. – John Wooten Jun 07 '18 at 20:35
  • @JohnWooten Try to install the module normally (ActiveState?). You absolutely shouldn't need nor use `push @INC...`. – zdim Jun 07 '18 at 20:40
  • My ppm isn’t in a standard place – John Wooten Jun 07 '18 at 20:40
  • @JohnWooten That is why I suggest to "_install the module normally_" so that it is where it's expected. (If there are reasons why you can't then that's a different question.) – zdim Jun 07 '18 at 20:50
  • Not ideal, but you could download the pure perl version (https://metacpan.org/source/ISHIGAKI/Text-CSV-1.95/lib/Text/CSV_PP.pm), store it in a file in the same directory as your script file and do `do "./CSV_PP.pm";` – glenn jackman Jun 07 '18 at 20:51
  • I did the ppm install Text-CSV after locating my ppm. the error message persisted that Text/CSV was not in the @INC, so I tried to explicit add it, to no avail. – John Wooten Jun 07 '18 at 20:51
  • You can use `use lib '/directory/to/my/lib';` followed by the `use Text::CSV;`. By the way, this didn't work for me. It just chopped AT the new line. I don't know if it is wrong, but I have had issues like this before. My Perl is super outdated and I am on a system with no root privileges (hence my non-standard installation method). – Jason Jun 07 '18 at 21:14
  • @Jason, for your field with the newline, is the field quoted? – glenn jackman Jun 07 '18 at 21:26
  • @glennjackman Sort of... It actually chopped the beginning of the line up to the line-feed. So, the closing quote is there. I have had similar issues parsing line-feeds within text qualifiers in Perl. I am almost certain it is just out of date modules. This led me to just write my own csv parser in C++. – Jason Jun 08 '18 at 13:37
  • Thanks, managed to get ppm installed and use the code above. If you read some of my comments, you'll find some of the issues. Numbers, on a mac, uses a unicode sequence to denote an embedded break in a field. Excel ignores this and just shows special characters there. If you quote the field in the csv, Excel will read it in ignoring the embedded breaks, but they are there in the csv file. – John Wooten Jun 11 '18 at 14:51
0

After a lot of reading hex dumps of the csv file, looking at the differences in the way Numbers on Macs and Excel on PCs treat embedded breaks, it appears that a simple portable way to locate and change the encoding for breaks is not worth the effort. If the data fields containing breaks are enclosed in quotes, then Excel will read them fine as will read.csv in R.

John Wooten
  • 685
  • 1
  • 6
  • 21
-1

Assuming that the quote character is the double quote " and the escape character is the double quote too, you can do it with sed if you translate all line-feeds to spaces when the number of double quotes is even in the pattern space. When the number is odd, you only have to append the next line.

sed ':a;/^\([^"]*"[^"]*"\)*[^"]*$/!{N;ba};y/\n/ /' file.csv

details:

:a    # define the label "a"
/^\([^"]*"[^"]*"\)*[^"]*$/! # if not an even number of quotes
{
    N  # append the next line to the pattern space
    ba # go to label "a"
}
y/\n/ / # translate all line-feeds to spaces

If quotes aren't well balanced, the default behaviour is to not proceed the last quoted part. You can change it writing:

sed ':a;${y/\n/ /;s/$/"/};/^\([^"]*"[^"]*"\)*[^"]*$/!{N;ba};y/\n/ /' file.csv
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125