1

This might be hard to explain, I will do my best. I am currently working on a csv transform stream parser in nodejs, but I am struggling in replacing all \n's and \r's inside quotes (") that wrap a value.

At the moment I have the following regex:

(^|[;])"(?:""|[^"])*[\n\r]+(?:""|[^"])*"

Where ; is the column delimiter.

And here is two examples, the first one where its doing what is expected and the second one where its capturing but it shouldn't because the ; is inside quotes.

First Test (success)

test;"123";"this description with new line feed  below should be
matched by regex";test;"1.0"
 

Second Test (error)

NewLine1;"test - this one should not be captured by the regex but its being captured ";test;1
NewLine2;"test that went wrong"

Is there a way to pick the text that is between quotes, containing semicolon before first quote and containing semicolon after last quote, but ignore semicolon inside quotes? I think that's what I need , so the second example is not take into account for the regex match.

Thank you in advance.

TiagoM
  • 3,458
  • 4
  • 42
  • 83
  • 1
    Wouldn't it be easier to parse the CSV first, and then run your regex over the individual values instead? – Niet the Dark Absol Aug 06 '20 at 18:59
  • 1
    changing regex into `(^|[;])"(?:""|[^";])*[\n\r]+(?:""|[^";])*"` must work. https://regex101.com/r/2JXXWJ/1 – mjrezaee Aug 06 '20 at 19:05
  • I can't parse first @NiettheDarkAbsol, its huge, I stream it and save it line by line. – TiagoM Aug 06 '20 at 19:08
  • Thanks @mjrezaee sounds like an easy fix, I will test it with a large sample and see if I came into any other issues. Also please use that as an asnwer, so I can proper accept it later :) – TiagoM Aug 06 '20 at 19:12

1 Answers1

1

You may use:

(^|;)"(?:""|[^";])*[\n\r]+(?:""|[^";])*"

Regex Demo

I changed [;] to ; because they're equivalent in your case. Also added ; character to [^";] because your CSV stream column value, can't have this character.

I don't know why you have "" in the regex but if you seek considering other double quotes in the column value, i assume they must be escaped by \ and so you can use regex like (^|;)"(?:(?<=\\)"|[^";])*[\n\r]+(?:(?<=\\)"|[^";])*" that has (?<=\\)" instead of "" which indicates " character preceding with back slashes. (\")

mjrezaee
  • 1,100
  • 5
  • 9
  • I notice now that I may have a quoted column with new line feed and a semicolon inside that I want to still apply the regex replace to delete the new line, I believe I should only avoid the cases like HTML entities such as   | " | ' and so on to the remaining ones, is it possible to still catch the semicolon and avoid those cases? thanks – TiagoM Aug 06 '20 at 19:30
  • @TiagoM see if `(^|;)"(?:""|&[a-z0-9]+;|[^";])*[\n\r]+(?:""|&[a-z0-9]+;|[^";])*"` (https://regex101.com/r/2JXXWJ/3) works for you. `&[a-z0-9]+;` detects html entity syntax. `&[a-z0-9]+;` can be more complete based on your case. see https://stackoverflow.com/a/56490838/13900748 if you have `#x` html entity syntax. – mjrezaee Aug 06 '20 at 19:40
  • Sorry for the trouble and confusion, but going with html entity syntax actually is not a good idea now that I am aware, because if I have another column delimiter in other cases it won't work. But I think I don't have much options here, what I must enforce to the csv provider is to not send any column delimiter inside quotes otherwise it won't work with the regex you provided (^|;)"(?:""|[^";])*[\n\r]+(?:""|[^";])*" on the answer. Like you see in this link https://regexr.com/59ou7 the first two lines should be captured but not the two ones below, but there isn't a way to differentiate.. – TiagoM Aug 06 '20 at 20:33
  • @TiagoM yes, you must have some limit in your input if you want to parse regex like this, else, you can face real problems. – mjrezaee Aug 06 '20 at 20:47
  • 1
    I have an idea, I do a pre-processing with a first regex to remove any column delimiter inside quotes, and then apply your regex, that might work. – TiagoM Aug 06 '20 at 20:54