Perl, use regex to find a match and replace just the last character of the match (in this case a line break)

Question

I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:

perl -pe 's/[^0-9]\r?\n//g'

while it did work it also replaces the last char before the line break

foob
ar

turns into

fooar

Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak

zdim · Answer 1 · 2022-10-17T07:48:44.407

One way is to use \K lookbehind

perl -pe 's/[^0-9]\K\r?\n//g'

Now it drops all matches up to \K so only what follows it is subject to the replacement side.

However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.

A simple example with Text::CSV

use warnings;
use strict;
use feature 'say';

use Text::CSV;

my $file = shift or die "Usage: $0 file.csv\n";

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 }); 

open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $row = $csv->getline($fh)) { 
    s/\n+//g for @$row; 
    $csv->say(\*STDOUT, $row);
}

Consider other constructor options, also available via accessors, that are good for all kinds of unexpected problems. Like allow_whitespace for example.

This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then convenient

perl -MText::CSV=csv -we' 
   csv in => *ARGV, on_in => sub { s/\n+//g for @{$_[1]} }' filename

With *ARGV the input is taken either from a file named on command line or from STDIN.

bobble bubble · Answer 2 · 2022-10-16T09:57:25.550

4

A negative lookbehind which is an assertion and won't consume characters can also be used.

(?<!\d)\R

\d is a a short for digit
\R matches any linebreak sequence

See this demo at regex101

edited Oct 16 '22 at 09:57

answered Oct 16 '22 at 09:03

bobble bubble

16,888
3
27
46

Thanks for sharing nice regex. So I assume `\R` is matching literal new lines? I asked other day 1 more expert but I couldn't understand it honestly, cheers. – RavinderSingh13 Oct 16 '22 at 09:38
1

@RavinderSingh13 There is a [nice answer from @hwnd according `\R`](https://stackoverflow.com/a/18992691/5527985). It matches any linebreak sequence (added this better link to my answer). – bobble bubble Oct 16 '22 at 09:43

Bohemian · Answer 3 · 2022-10-16T17:36:34.397

3

Just capture the last char and put it back:

perl -pe 's/([^0-9])\r?\n/$1/g'

edited Oct 16 '22 at 17:36

answered Oct 16 '22 at 02:41

Bohemian

412,405
93
575
722

Perl, use regex to find a match and replace just the last character of the match (in this case a line break)

3 Answers3