I am in troubles with a regexp to remove some \n

Question

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.

Here is the fragment

200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";

This is the regexp I used in https://regex101.com/

(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"

Which should get two groups, one before and one after some newline.

Looking at regexp101, it informs that the groups are correctly captured

But the result is wrong, because it still introduce an invisible new line as follow

I also try to use sed but the result is exactly the same.

So, the question is: Where am I wrong?

Does this answer your question? [Unix replace new lines inside quotes](https://stackoverflow.com/questions/33026137/unix-replace-new-lines-inside-quotes) — miken32, May 04 '22 at 15:24
See also [here](https://stackoverflow.com/questions/29150640/how-to-remove-new-lines-within-double-quotes) and [here](https://stackoverflow.com/questions/26406281/replace-new-line-character-between-double-quotes-with-space) — miken32, May 04 '22 at 15:26

choroba · Answer 1 · 2022-05-04T14:40:38.210

1

sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:

perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv

-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).

The corresponding sed invocation would be

sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv

on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.

Update:

perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv

This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.

edited May 04 '22 at 14:40

answered May 04 '22 at 10:50

choroba

231,213
25
204
289

No, it doesn't run. I have multiple lines, and your perl broken the file. For example I have `sfully.;133920765814;XZZ64b49c844353cE47C7V7UEKERFFYB;;2021-12-05 14:02:48;64499;2;;1843;1;;;;;;;2021-12-05 14:02:50.524693;frontend-01;""220;GBP;`. If you look at the end, the `220;GBP` was the start of a new line! – Paolo Di Pietro May 04 '22 at 13:08
That's quite different to what you showed - almost no double quotes! – choroba May 04 '22 at 13:27
@choroba there are a billion of lines. each one starts with a number,a semicolon, and a currency like "200;EUR" If a line ends with a \n and it is not followed by a number and a semicolon, the newline must be removed. What is after the newline must be preserved. – Paolo Di Pietro May 04 '22 at 14:15
What about `perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }'` then? – choroba May 04 '22 at 14:38

score 0 · Answer 2 · edited May 04 '22 at 13:43

0

perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for@{$_[1]}}))' file.csv

will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use

perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for@{$_[1]}}),sep=>";")' file.csv

edited May 04 '22 at 13:43

ikegami

367,544
15
269
518

answered May 04 '22 at 13:20

Tux

9
1

Your solution convert empty strings to NULL, and it's overly complex to use two calls to `csv`. See [my answer](https://stackoverflow.com/a/72113950/589924) – ikegami May 04 '22 at 13:41

ikegami · Answer 3 · 2022-05-04T13:47:11.850

It's usually best to use an existing parser rather than writing your own.

I'd use the following Perl program:

perl -MText::CSV_XS=csv -e'
   csv
      in             => *ARGV,
      sep            => ";",
      blank_is_undef => 1,
      quote_empty    => 1,
      on_in          => sub { s/\n//g for @{ $_[1] }; };
' old.csv >new.csv

Output:

200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";

If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

I am in troubles with a regexp to remove some \n

3 Answers3

Update: