7

I have a CSV file containing some line breaks in values. For example:

"Chiffre","Nom","Descriptif court","Tarifs en clair","Période en clair","Adresse 1","CP","Tel","Site","Facebook","Pictos","@Saveurs","@Famille plus","@Img","Accessible en fauteuil roulant en autonomie","Handicapes","Terrasse","Wifi","Chèque Vacances","Titre Restaurant"
6,"Le Chalet d'en Ô","Cuisine traditionnelle (foie gras, magret myrtilles, croustillant chocolat)
Spécialités savoyardes (fondue, tartiflette.. )
Garanti sans burgers.
Tout est fait maison.Cuisine traditionnelle (foie gras, magret myrtilles, croustillant chocolat)
Spécialités savoyardes (fondue, tartiflette.. )
Garanti sans burgers.
Tout est fait maison.","Menu adulte : de 20 à 30 €
Menu enfant : 10 €.

Suggestion du jour le midi en semaine : entrée,plat, dessert : 20€.Menu adulte : de 20 à 30 €
Menu enfant : 10 €.

Suggestion du jour le midi en semaine : entrée,plat, dessert : 20€.","Midi et soir du jeudi au samedi + midi le dimanche et jours fériés.
Juillet et août, midi et soir du mardi au samedi et midi uniquement dimanche et jours fériés.Midi et soir du jeudi au samedi + midi le dimanche et jours fériés.
Juillet et août, midi et soir du mardi au samedi et midi uniquement dimanche et jours fériés.","Le Cropt - Route de Serraval","74230 Les Clefs","+33 4 50 02 09 00","www.lechaletdeno.com",,"A R J X x",,,,,"A","R","J","X","x"
7,"La Cabane - Pisciculture de Montremont","Bar - restaurant au bord de la rivière dans un cadre champêtre avec sa spécialité la truite.","Menu adulte : de 26 à 35 €.","Juin et septembre : du mercredi au dimanche.
Juillet et août : tous les jours à midi + mercredi à samedi le soir.Juin et septembre : du mercredi au dimanche.
Juillet et août : tous les jours à midi + mercredi à samedi le soir.","Pisciculture de Montremont - 100 impasse des Pesetz","74230 Thônes","+33 4 50 02 00 85","pisciculture-montremont.fr",,"A R  X x",,,,,"A","R",,"X","x"

Just need to replace any line break with a space (or anything).

Tried a lot of existing solutions, but using \\n as search term doesn't seem to work, among other issues.

Any idea? Thanks.

Sublime Text would suffice but if another tool is easier, no problem.

Edit: \n works fine if I have only this as the search term. But I need to find them only between ". The best I got so far is matching all text between " with: (?<=")[^"]*.

Nico Prat
  • 686
  • 2
  • 6
  • 13

3 Answers3

10

You cannot do that with a text editor in a safe way because the starting and ending delimiters are identical, and any lookaround-based solution, or a \G based solution won't work well.

Use some programming language that supports a callback method/function as the replacement argument in a regex replace method/function, match any double quoted substring with

"[^"]*(?:""[^"]*)*"

See the regex demo. A simplified version if you do not need to care about literal double quotes is "[^"]+".

Details

  • " - a double quote
  • [^"]* - 0+ chars other than double quotes
  • (?: - a grouping construct that is repeated 0+ times
    • "" - 2 double quotes
    • [^"]* - 0+ chars other than double quotes
  • )* -
  • " - a double quote.

This regex can be used as follows: read in a file and use the following solutions:

  • : re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group(0).replace("\n", ""), s)
  • : s = s.replace(/"[^"]*(?:""[^"]*)*"/g, function(m) { return m.replace(/\n/g, ''); })
  • : $s = preg_replace_callback('~"[^"]*(?:""[^"]*)*"~', function($m) { return str_replace("\n", "", $m[0]); }, $s)
  • : s = Regex.Replace(s, "\"[^\"]*(?:\"\"[^\"]*)*\"", m => m.Value.Replace("\n", ""))

If you have \r and \n to remove, as a second step, you may use .replace(/[\r\n]+/g, '') in JS, preg_replace('~\R+~', '', $m[0]) in PHP, m.Value.Replace("\r", "").Replace("\n","") in C# and m.group(0).replace("\n", "").replace("\n", "") in Python.

In C#, the full solution will look like

using System.IO;
...
var file = "path_to_file";
var path_to_save = "path_to_save";
var contents = string.Empty;
using (var sr = new StreamReader(file, true)) // true for a Unicode encoding
{
    contents = sr.ReadToEnd();
}
contents = Regex.Replace(contents, "\"[^\"]*(?:\"\"[^\"]*)*\"", 
    m => m.Value.Replace("\n", "").Replace("\r", ""));
using (var sw = new StreamWriter(path_to_save, false, Encoding.UTF8))
{
    sw.Write(contents);
    sw.Close();
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • good solution. It loads whole file as a string object. what is the performance if the file is big? – CSK Nov 04 '19 at 18:47
  • @CSK The regex follows the [unroll-the-loop principle](https://stackoverflow.com/a/38018490/3832970) so it is as fast as an NFA regex can be. – Wiktor Stribiżew Nov 04 '19 at 20:03
  • I have no doubt about the regex performance. If the file is 1GB or 10GB. then the String object "contents" will be 1GB or 10GB. not sure how C# handles in this case. – CSK Nov 05 '19 at 18:51
  • and the file can't be read by line because '\n' could be existing in double quote. – CSK Nov 05 '19 at 18:53
  • 1
    @CSK I see, I had a workaround at my former work for that. I wrote a C# app that read the file in chunks of some X megabytes and made replacements there, then the chunks were concatenated and the file was split into chunks of another size and ran a replace the second time. – Wiktor Stribiżew Nov 05 '19 at 19:01
  • @WiktorStribiżew : great answer thanks! We are trying to use it but would vastly appreciate a little more help with it. Would mean a lot to us if you could have a look at it here: https://stackoverflow.com/questions/73448343/regex-expression-to-replace-newlines-between-quotation-marks-but-only-if-there – levraininjaneer Aug 23 '22 at 06:55
3

Use Notepad++ regex Find-and-Replace:

Find what:

(,"[^"]*?)[\r\n]+

Replace with:

$1 

(There is a single space after $1)

Repeatedly click "Replace All" until no more matches are found.

Josh Withee
  • 9,922
  • 3
  • 44
  • 62
1

For anyone arriving here searching for a system shell (, etc) answer. You can use the sed answer from Alister, from this unix.com link, or perl.

:

sed -n 'H;g;/^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/d; s/^\n//; y/\n/ /; p; s/.*//; h' current.csv > new.csv

:

perl -0777 -pe '                                                                                  
  s{".*?"}{
    ($r = $&) =~ s/\n/ /g;
    $r
  }gsem' < current.csv > new.csv
Ferroao
  • 3,042
  • 28
  • 53