I (somewhat accidentally) came up with a working regex:
replace(/^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm, '$1')
I wanted:
- remove single line comments at start of line or end of line,
- to use single and double quotes,
- the ability to have just one quote in a comment: that
'
s useful (but accept "
as well)
(so matching on a balanced set (even number) of quotes after a comment-delimiter as in Tim Pietzcker's answer was not suitable),
- leave comment-delimiter
;
alone in correctly (closed) quoted 'strings'
- mix quoting style
- multiple quoted strings (and comments in/after comments)
- nest single/double quotes in resp. double/single quoted 'strings'
- data to work on is like valid ini-files (or assembly), as long as it doesn't contain escaped quotes or regex-literals etc.
Lacking look-back on javascript I thought it might be an idea to not match comments (and replace them with ''
), but match on data preceding the comment and then replace the full match data with the sub-match data.
One could envision this concept on a line by line basis (so replace the full line with the match, thereby 'loosing' the comment), BUT the multiline parameter doesn't seem to work exactly that way (at least in the browser).
[^'";]*
starts eating any characters from the 'start' that are not '";
.
(Completely counter-intuitive (to me), [^'";\r\n]*
will not work.)
(?:'[^']*'|"[^"]*")?
is a non-capturing group matching zero or one set of quote any chars quote
(and (?:(['"])[^\2]*\2)?
in /^((?:[^'";]*(?:(['"])[^\2]*\2)?)*)[ \t]*;.*$/gm
or
(?:(['"])[^\2\r\n]*\2)?
in /^((?:[^'";]*(?:(['"])[^\2\r\n]*\2)?)*)[ \t]*;.*$/gm
(although mysteriously better) do not work (broke on db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as
), but not adding another capturing group for re-use in the match is a good thing as they come with penalties anyway).
The above combo is placed in a non-capturing group which may repeat zero or more times and it's result is placed in a capturing group 1
to pass along.
That leaves us with [ \t]*;.*
which 'simply' matches zero or more spaces and tabs followed by a semicolon, followed by zero or more chars that are not a new line. Note how ;
is NOT optional !!!
To get a better idea of how this (multi-line parameter) works, hit the exp
button in the demo below.
function demo(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '$1'
)
.replace( /[ \t]*$/gm, ''); //optional trim
}
function demo_exp(){
var elms=document.getElementsByTagName('textarea');
var str=elms[0].value;
elms[1].value=str.replace( /^((?:[^'";]*(?:'[^']*'|"[^"]*")?)*)[ \t]*;.*$/gm
, '**S**$1**E**' //to see start and end of match.
);
}
<textarea style="width:98%;height:150px" onscroll="this.nextSibling.scrollTop=this.scrollTop;">
; This is a comment line
keyword1 keyword2 ; comment
keyword3 "key ; word 4" ; comment
"Text; in" and between "quotes; plus" semicolons; this is the comment
; This is a comment line
keyword1 keyword2 ; comment
keyword3 'key ; word 4' ; comment and one quote ' ;see it?
_b64decode:
db 0x83,0xc6,0x3A ; add si, b64decode_end - _b64decode ;39
push 'a'
pop di
cmp byte [si], 0x2B ; '+'
b64decode_end:
;append base64 data here
;terminate with printable character less than '+'
db 'WDVPIVAlQEFQ;WzRcU',"hi;hi",0xfe,"'as;df'" ;'haha"
;"end'
</textarea><textarea style="width:98%;height:150px" onscroll="this.previousSibling.scrollTop=this.scrollTop;">
result here
</textarea>
<br><button onclick="demo()">remove comments</button><button onclick="demo_exp()">exp</button>
Hope this helps.
PS: Please comment valid examples if and where this might break! Since I generally agree (from extensive personal experience) that it is impossible to reliably remove comments using regex (especially higher level programming languages), my gut is still saying this can't be fool-proof. However I've been throwing existing data and crafted 'what-ifs' at it for over 2 hours and couldn't get it to break (, which I'm usually very good at).