10

How do i replace all enters between two quotes in a text file. The first quote is always preceded by a tab or it is the first character in the row (csv file). I tried the following regex

/(\t"|^")([^"]*)(\n)([^"]*")/gm

but this regex only matches the first enter between two quotes, not all.

For example, the following text:

xx "xx 
xx 
xx" 
xx 
"xx"
xx 
xx
"xxx xxx 
xx"

should become

xx "xx xx xx" 
xx 
"xx"
xx 
xx
"xxx xxx xx"

I read the following post ( javascript regex replace spaces between brackets ) which is very similar, but the regex suggested there is not useable in my situation.

Community
  • 1
  • 1
Nebu
  • 1,753
  • 1
  • 17
  • 33
  • Which language is this? Javascript? Also, if you have a CSV file, use a CSV parser. – Tomalak Jun 07 '16 at 09:22
  • 1
    A regular expression to handle that at once will probably become very ugly and slow. Consider a multi-pass approach: 1. extract all quoted texts; 2. replace _all_ `\n` in the quoted texts; 3. reassemble the non-quoted parts with the corrected quoted parts. – Good Night Nerd Pride Jun 07 '16 at 09:27
  • @Tomalak I updated the question, javascript is fine. I am using a csv parser but this parser is giving an error because of an enter at a wrong position. – Nebu Jun 07 '16 at 09:30
  • Then use a better parser. For example, http://papaparse.com/ deals with quoted values and line breaks in values just fine. Don't use regex for this. – Tomalak Jun 07 '16 at 11:33

2 Answers2

13

With Javascript replace you can use a function as replacement.

var str = 'foo \n"a\n" bar\n';

str = str.replace(/"[^"]+"/g, function(m) {
 return m.replace(/\n/g, ' ');
});

console.log(str);

The regex "[^"]+" will match quoted stuff with one or more non-quotes in between.

Add conditions such as tab or start to the pattern as needed: (?:\t|^)"[^"]+"

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • This approach is very similar to Abbondanza suggestion. The idea is interesting, only downside is all data between quotes is processed (also text between quotes without enters). Any way to solve this? – Nebu Jun 07 '16 at 12:02
  • @Nebu You can modify the pattern to require at least one newline `\n` inside the quotes like this: [`(?:\t|^)"[^"\n]*\n[^"]+"`](https://regex101.com/r/qP4cW1/1) (also require `^` start or `\t` before). You need to test if this would make it considerable faster on your input. – bobble bubble Jun 07 '16 at 16:53
2
\n(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)

You can use this and replace by empty string.

See Demo

var re = /\n(?=[^"]*"(?:[^"]*"[^"]*")*[^"]*$)/g; 
var str = 'xx "xx \nxx \nxx" \nxx \n"xx"\nxx \nxx\n"xxx xxx \nxx"';
var subst = ''; 

var result = str.replace(re, subst);
vks
  • 67,027
  • 10
  • 91
  • 124
  • 1
    This works for small files. Unfortunately my text files are rather large (50000+ rows). For these files this regex requires to many steps. – Nebu Jun 07 '16 at 11:54