Strip comments from text except for comment char between quotes

Question

I'm trying to build a regexp for removing comments from a configuration file. Comments are marked with the ; character. For example:

; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

The difficulty I have is ignoring the comment character when it's placed between quotes.

Any ideas?

I've created something the simple way, without a regexp, parsing each line character for character. But I'm trying to understand how to solve this kind of problem using a regexp. — user828227, Jan 26 '12 at 16:37

Tim Pietzcker · Answer 1 · 2012-01-26T13:35:04.263

You could try matching a semicolon only if it's followed by an even number of quotes:

;(?=(?:[^"]*"[^"]*")*[^"]*$).*

Be sure to use this regex with the Singleline option turned off and the Multiline option turned on.

In Python:

>>> import re
>>> t = """; This is a comment line
... keyword1 keyword2 ; comment
... keyword3 "key ; word 4" ; comment"""
>>> regex = re.compile(';(?=(?:[^"]*"[^"]*")*[^"]*$).*', re.MULTILINE)
>>> regex.sub("", t)
'\nkeyword1 keyword2 \nkeyword3 "key ; word 4" '

score 0 · Answer 2 · edited May 23 '17 at 11:58

I (somewhat accidentally) came up with a working regex:

replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm, '$1')

I wanted:

remove single line comments at start of line or end of line,
to use single and double quotes,
the ability to have just one quote in a comment: that's useful (but accept " as well)
(so matching on a balanced set (even number) of quotes after a comment-delimiter as in Tim Pietzcker's answer was not suitable),
leave comment-delimiter ; alone in correctly (closed) quoted 'strings'
mix quoting style
multiple quoted strings (and comments in/after comments)
nest single/double quotes in resp. double/single quoted 'strings'
data to work on is like valid ini-files (or assembly), as long as it doesn't contain escaped quotes or regex-literals etc.

Lacking look-back on javascript I thought it might be an idea to not match comments (and replace them with ''), but match on data preceding the comment and then replace the full match data with the sub-match data.
One could envision this concept on a line by line basis (so replace the full line with the match, thereby 'loosing' the comment), BUT the multiline parameter doesn't seem to work exactly that way (at least in the browser).

[^'";]* starts eating any characters from the 'start' that are not '";.
^{(Completely counter-intuitive (to me), [^'";\r\n]* will not work.)}

(?:'[^']*'|"[^"]*")? is a non-capturing group matching zero or one set of quote any chars quote ^{(and (?:(['"])[^\2]*\2)? in /^((?:[^'";]*(?:(['"])[^\2]*\2)?)*)[ \t]*;.*$/gm or

(?:(['"])[^\2\r\n]*\2)? in /^((?:[^'";]*(?:(['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm (although mysteriously better) do not work (broke on db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as), but not adding another capturing group for re-use in the match is a good thing as they come with penalties anyway).}

The above combo is placed in a non-capturing group which may repeat zero or more times and it's result is placed in a capturing group 1 to pass along.

That leaves us with [ \t]*;.* which 'simply' matches zero or more spaces and tabs followed by a semicolon, followed by zero or more chars that are not a new line. Note how ; is NOT optional !!!

To get a better idea of how this (multi-line parameter) works, hit the exp button in the demo below.

function demo(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '$1'
                           )
                   .replace( /[ \t]*$/gm, ''); //optional trim
}


function demo_exp(){
  var elms=document.getElementsByTagName('textarea');
  var str=elms[0].value;
  elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
                           , '**S**$1**E**'  //to see start and end of match.
                           );
}

<textarea  style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

  
"Text; in" and between "quotes; plus" semicolons; this is the comment
  
  ; This is a comment line
  keyword1 keyword2 ; comment
  keyword3 'key ; word 4' ; comment and one quote ' ;see it?
  
_b64decode:
        db    0x83,0xc6,0x3A ; add   si, b64decode_end - _b64decode ;39
        push  'a'   
        pop   di 
  
        cmp   byte [si], 0x2B ; '+'


b64decode_end:
        ;append base64 data here
        ;terminate with printable character less than '+'
        db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
  
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>

Hope this helps.

PS: Please comment valid examples if and where this might break! Since I generally agree (from extensive personal experience) that it is impossible to reliably remove comments using regex (especially higher level programming languages), my gut is still saying this can't be fool-proof. However I've been throwing existing data and crafted 'what-ifs' at it for over 2 hours and couldn't get it to break (, which I'm usually very good at).

score 0 · Answer 3 · answered Jan 26 '12 at 13:27

0

No regex :)

$ grep -E -v '^;' input.txt
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment

answered Jan 26 '12 at 13:27

3

the `; comment` in the two output sentence should be removed, too. – Felix Yan Jan 26 '12 at 13:29
2

This is also doesn't work if the comment is preceded by whitespace (or, for that matter, anything). – Borealid Jan 26 '12 at 15:52

score 0 · Answer 4 · answered Jan 26 '12 at 13:28

0

You may use regexp to get all strings out first, replace them with some place-holder, and then simply cut off all \$.*, and replace back the strings at last :)

answered Jan 26 '12 at 13:28

Felix Yan

14,841
7
48
61

Are you serious? You'd need to keep track exactly which strings you took from which positions, calculate how these positions change after you've removed the comments, and re-insert them? – Tim Pietzcker Jan 26 '12 at 13:30
No, no need to keep position axies, but as I already said, place a place-holder such as `[STRING#1]` or anything that would not conflict with his original string. – Felix Yan Jan 26 '12 at 13:32
Or, in this post http://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python gives some way better, but the key point is the same: split the strings first – Felix Yan Jan 26 '12 at 13:34

score 0 · Answer 5 · answered Jan 26 '12 at 13:30

0

Something like this:

("[^"]*")*.*(;.*)

First, match any number of text between quotes, then match a ;. If the ; is between quotes it will be matches by the first group, not by the second group.

answered Jan 26 '12 at 13:30

Sjoerd

74,049
16
131
175

What if the line is `"Text; in" and between "quotes; plus" semicolons; this is the comment`? – Tim Pietzcker Jan 26 '12 at 13:41

Strip comments from text except for comment char between quotes

5 Answers5