2

I am having a big CSV file which I am parsing in Java. The problem is, that in some of the text sections, which are marked with "", I am having line breaks. I am now trying to remove all the line breaks in the "" sections but was not successful so far.

For example, I am having the following CSV:

"Test Line wo line break"; "Test Line 
with line break"
"Test Line2 wo line break"; "Test Line2 
with line break"

The result should be:

"Test Line wo line break"; "Test Line with line break"
"Test Line2 wo line break"; "Test Line2 with line break"

I have tried the following so far:

s.replaceAll("(\\w)*\r\n", "$1");

But this, unfortunately, replaces all line breaks, also the one at the end of the lines.

Then I added the double apostrophes to the regex:

s.replaceAll("\"(\\w)*\r\n\"", "$1");

But with this, unfortunately, nothing gets replaces at all.

Can you please help me find out what I am doing wrong here?

Thanks in advance

Miss Chanandler Bong
  • 4,081
  • 10
  • 26
  • 36
TerenceJackson
  • 1,776
  • 15
  • 24

2 Answers2

4

You may match all substrings between double quotation marks using a simple "[^"]*" regex and remove all linebreaks in between using

String s = "\"Test Line wo line break\"; \"Test Line \nwith line break\"\n\"Test Line2 wo line break\"; \"Test Line2 \nwith line break\"";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(s);
while (m.find()) {
    m.appendReplacement(result, m.group().replaceAll("\\R+", ""));
}
m.appendTail(result);
System.out.println(result.toString());

Or, beginning with the Java 9+, you can use a bit shorter code:

String s = "\"Test Line wo line break\"; \"Test Line \nwith line break\"\n\"Test Line2 wo line break\"; \"Test Line2 \nwith line break\"";
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(s);
s = m.replaceAll(r -> m.group().replaceAll("\\R+", ""));
System.out.println(s);

Output:

"Test Line wo line break"; "Test Line with line break"
"Test Line2 wo line break"; "Test Line2 with line break"

See the Java demo online / Java code demo #2.

Note that .replaceAll("\\R+", "") finds 1 or more any line break sequences and removes them only from what "[^"]*" matched.

Escape sequence support between double quotation marks

If your strings between double quotes can contain escaped sequences you need to use a different pattern:

Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"", Pattern.DOTALL)

Note the Pattern.DOTALL option, it will allow . to match line break chars.

Details:

  • " - a " char
  • [^"\\]* - zero or more chars other than " and \ chars
  • (?:\\.[^"\\]*)* - zero or more sequences of a \ and any char after it followed with zero or more chars other than " and \ chars
  • " - a " char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

I wouldn't recommend parsing CVS yourself if you can avoid it. In general parsing raw text often become a hazzle because you need to deal with all sorts of exceptions, and for instance you quite easily reach the point where regular expressions are not enough and you need to be able to parse context free grammars.

There are some options on libraries for parsing CSV here: CSV parsing in Java - working example..?

Rohde Fischer
  • 1,248
  • 2
  • 10
  • 32