3

I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:

perl -pe 's/[^0-9]\r?\n//g'

while it did work it also replaces the last char before the line break

foob
ar

turns into

fooar

Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak

Life-orb
  • 31
  • 3

3 Answers3

4

One way is to use \K lookbehind

perl -pe 's/[^0-9]\K\r?\n//g'

Now it drops all matches up to \K so only what follows it is subject to the replacement side.


However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.

A simple example with Text::CSV

use warnings;
use strict;
use feature 'say';

use Text::CSV;

my $file = shift or die "Usage: $0 file.csv\n";

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 }); 

open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $row = $csv->getline($fh)) { 
    s/\n+//g for @$row; 
    $csv->say(\*STDOUT, $row);
}

Consider other constructor options, also available via accessors, that are good for all kinds of unexpected problems. Like allow_whitespace for example.

This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then convenient

perl -MText::CSV=csv -we' 
   csv in => *ARGV, on_in => sub { s/\n+//g for @{$_[1]} }' filename

With *ARGV the input is taken either from a file named on command line or from STDIN.

zdim
  • 64,580
  • 5
  • 52
  • 81
4

A negative lookbehind which is an assertion and won't consume characters can also be used.

(?<!\d)\R
  • \d is a a short for digit
  • \R matches any linebreak sequence

See this demo at regex101

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • Thanks for sharing nice regex. So I assume `\R` is matching literal new lines? I asked other day 1 more expert but I couldn't understand it honestly, cheers. – RavinderSingh13 Oct 16 '22 at 09:38
  • 1
    @RavinderSingh13 There is a [nice answer from @hwnd according `\R`](https://stackoverflow.com/a/18992691/5527985). It matches any linebreak sequence (added this better link to my answer). – bobble bubble Oct 16 '22 at 09:43
3

Just capture the last char and put it back:

perl -pe 's/([^0-9])\r?\n/$1/g'
Bohemian
  • 412,405
  • 93
  • 575
  • 722