-2

With a regex: How can I match comments which begin with a semicolon unless the semicolon is surrounded on both sides by unescaped quotes, as shown below (the green blocks denote the matched comments )?:

Sample input and output

Note, that the dquotes can by escaped by doubling them up "". Such escaped dquotes behave as completely different characters, i.e. they do not have the ability to surround the semicolon and disable its comment-starting function.

Also, unbalanced dquotes are treated as escaped dquotes.

With Bubble's help, I have gotten as far as the regex below, which fails to correctly treat a trailing escaped dquote in the last test vector line.

^(?>(?:""[^""\n]*""|[^;""\n]+)*)""?[^"";\n]*(;.*)

See it run here.

Test vectors (the same as in the color-coded diagram above):

Peekaboo ; A comment starts with a semicolon and continues till the EOL
Unless the semicolon is surrounded by dquotes ”Don’t do it ; here” ;but match me; once
Im not surrounded ”so pay attention to me” ; ”peekaboo”
Im not surrounded ”so pay attention” to;me” ; ”peekaboo”
Im not surrounded ”so pay attention to me ; peekaboo
Dquote escapes a dquote so ”dont pay attention to ””me;here”” buster” do it ; here
Don’t pay attention to  ”””me;here””” but do ””it;here””
and ”dont do ””it;here”””  either ;peekaboo
but "pay attention to "it;here"" ;not here though
Simon said ”I like goats” then he added ”and sheep;” ;a good comment is ”here
Simon said ”I like goats” then he added ”and sheep;” dont do it here
Simon said ””I like goats;”peekaboo
Simon said ”I like goats;””peekaboo
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/250101/discussion-on-question-by-pavel-stepanek-match-comments-unless-the-initiating-ch). – sideshowbarker Dec 02 '22 at 22:34
  • 3
    This question is being discussed on meta. https://meta.stackoverflow.com/q/421805/ – PM 2Ring Dec 03 '22 at 15:48
  • What does *"Bubble's help"* refer to? A (deleted) comment? [The answer](https://stackoverflow.com/questions/74636553/match-comments-unless-the-initiating-character-is-surrounded-by-unescaped-quotes/74658163#74658163) (or comments to it)? – Peter Mortensen Dec 03 '22 at 15:50
  • 1
    @Peter: That is the name of the member Bobble Bubble who helped me with the original regex. All of his comments have disappeared from this comment section but he has not deleted them. – Pavel Stepanek Dec 03 '22 at 17:13
  • 1
    @Peter: Thanks for fixing the syntax highlighting for the test vector list. I was trying to do it myself but my knowledge of the markdown was not extensive enough for the task. – Pavel Stepanek Dec 03 '22 at 17:48
  • 1
    [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) – philipxy Dec 04 '22 at 07:26
  • 2
    @philipxy Sure, but here the OP has posted the example data as proper text and has supplemented it with a helpful colour-coded image. That's not against the rules, IMHO. – PM 2Ring Dec 04 '22 at 07:33
  • @PM2Ring I agree that having both is OK, but then the image is generally redundant. The image isn't the same as the text though, because the image uses colour to show something, and it's good that they explain that, but it's something that could be pointed out via text. I didn't notice that the underlying text is the same & they don't say it's the same, which they should. Thanks. – philipxy Dec 04 '22 at 07:38
  • @philipxy: I have added the statement: *"Test vectors `(the same as in the color-coded diagram above)`"* I would have color-coded the text in that list with some markdown if it were possible - is it? – Pavel Stepanek Dec 04 '22 at 09:44

1 Answers1

5

The task is to find comments starting with a ; semicolon outside quotes considering "" escaped quotes and a potential non-closed quote before. This approach works for yet provided test cases.

Updated pattern: A shorter and more efficient variant without alternation.

^((?>(?:(?:[^"\n;]*"[^"\n]*")+(?!"))?[^"\n;]*)"?[^"\n;]*);.*

New demo at regex101

This pattern works without alternation and uses a negative lookahead to check for the last valid double quote. In both patterns the atomic group mimics possessive quantifiers to prevent any backtracking and keep the balance. Using possessive quantifiers the pattern would look like this regex101 demo. [^";\n]*"?[^";\n]* is the part that is allowing an optional non-closed quote.


Previous pattern: This turned out to be reliable yet but is a little bit slower.

^((?>(?:(?:[^;"\n]*"(?>(?:[^"\n]+|"")*)")+)?)[^";\n]*"?[^";\n]*);.*

Old demo at regex101

"(([^"]+|"")*)" consumes either " ... " or "". This gets repeated any amount of times with any [^;"]* characters that are not ; or " in between. All that is done inside an atomic group. Having matched the quoted parts with any non semicolons in between due to use of an atomic group there is no way back. After finally allowing an optional non-closed " either a ; will be found or it fails.


The first capturing group $1 contains the part up to the targeted ; comment-start. To remove the comment, replace the full match with the captured part. If needed capture (.*) to a second group.

regex-part matches
(?>...) denotes an atomic group, used to prevent any further backtracking
[^...] a negated character class matches a single character not in the listed
(...) and (?:...) capturing and non capturing groups (latter for repitition or alternation)
quantifiers: ? * + ? matches zero or one (optional), * any amount and + one or more

If replacements are done on single lines, all the \n newlines can be dropped from either pattern.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • I'd like to add that in order to make this regex work in .NET and C# all of the dquotes should be doubled up. – Pavel Stepanek Dec 02 '22 at 16:32
  • How would you wrap the found comment in constant HTML tags upon replace? `$1$2` would not work . Also, I have another comment coming up... Unfortunately, I need to hit the road now :( – Pavel Stepanek Dec 02 '22 at 16:41
  • @PavelStepanek Yes that would work but currently there is no second group :p You need to wrap `.*` at the end into another group, see [this update](https://regex101.com/r/ygfVpP/1). – bobble bubble Dec 02 '22 at 17:27
  • Oh, I added the 2nd group already but my trepidation is that the 1st group does not match everything besides the comment and that will unintentionally delete fragments of the input. Also, being a C programmer at heart I can't help but notice that this woud unecessarily replacing the same string with the same string with a bunch of unchanged characters from the 1st group - a superfluous memory copy. – Pavel Stepanek Dec 02 '22 at 22:51
  • I noticed that your regex does not utilize expressions of this form: `^"(?>("")*);` which match ONLY ODD number of dquotes. See: https://regex101.com/r/lalv6r/1 Do you think that your regex could be simplified or accelerated by using such forms ? Not that it is bad, or anything... – Pavel Stepanek Dec 02 '22 at 22:54
  • @PavelStepanek Very good spotted! Yes indeed I tried to [unroll](https://doc.lagout.org/programmation/Regular%20Expressions/Mastering%20Regular%20Expressions_%20Understand%20Your%20Data%20and%20Be%20More%20Productive%20%283rd%20ed.%29%20%5BFriedl%202006-08-18%5D.pdf#%5B%7B%22num%22%3A696%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22FitH%22%7D%2C807%5D) this too, but haven't found any solution that would maintain the same functionality. Don't worry, it won't make much difference here, still efficient due the atomic groups. Regarding your other conerns, using regex here is a challenge and tradeoff. – bobble bubble Dec 03 '22 at 00:19
  • Please help me understand: Which part of your regex is responsible for handling the "non-closed unescaped quote before the semicolon" ? BTW: I read the O'Reilly book you have linked - thanks. – Pavel Stepanek Dec 03 '22 at 18:51
  • @PavelStepanek Do you mean e.g. the `Simon said "I like goats;""peekaboo` line? It's just [the part](https://regex101.com/r/A7FqYZ/1) after the whole optional group. Btw I played with a [new more efficient version](https://regex101.com/r/Kk0kuN/2) (test it if you like and let me know). I read on meta, don't worry, sometimes it's difficult in the regex place but it got better already. I find your question interesting. I can update my answer to the new pattern if that works for your test vectors, it will be easier to explain. – bobble bubble Dec 03 '22 at 19:50
  • Yeah, `Simon said "I like goats;""peekaboo` fits that bill, as well as `Simon said ""I like goats;"peekaboo` and `Im not surrounded ”so pay attention to me ; peekaboo` and `and ”dont do ””it;here””” either ;peekaboo`. All of these cases are handled correctly by your regex - I am just trying to understand how / which part of it is responsible for handling these unbalanced dquotes. – Pavel Stepanek Dec 03 '22 at 20:08
  • OK, testing... the "new version" fails with `"I like goats;""peekaboo"` but still works with `""I like goats;"peekaboo` Something does not like to be at the beginning of the line... – Pavel Stepanek Dec 03 '22 at 20:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250120/discussion-between-bobble-bubble-and-pavel-stepanek). – bobble bubble Dec 03 '22 at 20:27