4

I've got a CSV file which gives problems importing because of the fields containing new line characters. As CSV importers treat every newline as a new row, the newlines in the fields mess things up.

So I want to replace every newline within double quotes with <br> leaving the 'real' newlines outside double quotes untouched.

First step would be to be able to create a regexp to get the newlines within the text file.

(\n|\r)

But after this I get lost, because I can't find the xxx within this expression:

(")(xxx)(\n|\r)(xxx)(") 

And a solution to let the 'real' newlines untouched.

Concluding: How can I replace every newline within a field, enclosed by double quotes with a <br>

I'm now using sublime text to test the regex but later it will be used in a php application.

stUrb
  • 6,612
  • 8
  • 43
  • 71

5 Answers5

5

Search pattern:

("[^"\n]*)\r?\n(?!(([^"]*"){2})*[^"]*$)

Replacement Pattern:

$1<br>

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Almost!!! When you have multiple lines the newline at the end will also be replaced: http://regex101.com/r/wL9sQ4/1 – stUrb Oct 13 '14 at 10:41
  • Oh my this seems to be working !! Now just figuring out _how_ it works :) – stUrb Oct 13 '14 at 10:47
  • Glad to know it worked out. Just for more info on regex `(?!(([^"]*"){2})*[^"]*$)` is a lookahead that says don't match if **there are even number of quotes** ahead of the match. – anubhava Oct 13 '14 at 11:02
  • What if there are several newlines between quotes ? It works only when there is only one new line ( – Fəqan Çələbizadə Jan 08 '18 at 17:45
  • @FəqanÇələbizadə: Difficult to understand it from your comments. Feel free to post a question so that I can try to answer it. – anubhava Jan 08 '18 at 18:10
1

Could use preg_replace_callback() to match the quoted parts with an anonymous function:

$str = preg_replace_callback('~"[^"]+"~', function ($m) {
  return preg_replace('~\r?\n~', "<br>", $m[0]);
}, $str);

$m[0] corresponds to what's matching the whole pattern.

Test at eval.in; Regex FAQ

Community
  • 1
  • 1
Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • Is it possible to put this in one regex? The two separate regexs work perfect, but it would be very handy if it was only one regex – stUrb Oct 13 '14 at 10:44
  • 1
    @stUrb see anubhava solution for one regex :) Chose this one as it's simple. – Jonny 5 Oct 13 '14 at 10:56
  • @stUrb Oooh, somehow I thought the quotes should be removed :) Then it's even simpler, don't need capture group, see updated answer. – Jonny 5 Oct 13 '14 at 13:36
1

CSV format is complicated and, in the general case, cannot be parsed reliably with regular expressions. My suggestion is to use a parser, for example, parsecsv-for-php handles quoted newlines quite decently:

$str = <<<EOF
42,okay,"okay too","here
be
""dragons""!",43

EOF;

$csv = new parseCSV();
$csv->heading = false;
$result = $csv->parse_string($str);

result:

array(1) {
  [0]=>
  array(5) {
    [0]=>
    string(2) "42"
    [1]=>
    string(4) "okay"
    [2]=>
    string(8) "okay too"
    [3]=>
    string(18) "here
be
"dragons"!"
    [4]=>
    string(2) "43"
  }
}
georg
  • 211,518
  • 52
  • 313
  • 390
  • Hmm that's an interesting library! I definitely am going to look into using this library! – stUrb Oct 13 '14 at 10:56
1

I was facing same issue when importing a csv file using html5 drag and drop. I used below (javascript) fix and it is worked for me.

//Fix: if there are any line breaks in between column values, then that column value will be wrapped in double quotes
//first get all the double quoted values list using regex
var lineBreakList = textcontent.match(/(["'])(\\?)[\s\S]*?\1/g);
//search for line breaks using regex, if found replace it with empty value
lineBreakList.forEach(function (item) {
    textcontent = textcontent.replace(item, item.replace(/\r?\n|\r/g, ''));
});

Hope this will help someone :)

Naresh Chennuri
  • 1,133
  • 11
  • 10
0

Try this:

<?php

  $line = "\"abc\",\"\n\",\"def\"\n";
  $line = preg_replace('/"(.*?)\\n(.*?)"/', "\"$1<br>$2\"", $line);

?>

With this input:

"abc","\n","def"\n

it prints this output:

"abc","<br>","def"\n
MarcoS
  • 17,323
  • 24
  • 96
  • 174