1

This a question I've solved and wanted to post in Q&A style because I think more people could use the solution. Or maybe improve the solution, show where it breaks.

The problem

You wanna do something with quoted strings and/or comments in a body of text. You wanna extract them, highlight them, what have you. But some quoted strings are inside comments, and sometimes comment-characters are inside strings. And strings delimiters can be escaped, and comments can be line-comments or block comments. And when you thought you had a solution somebody complains that it doesn't work when there's a regex-literal in his JavaScript. What do?

Concrete example

var ret = row.match(/'([^']+)'/i); // Get 1st single quoted string's content
if (!ret) return ''; /* return if there's no matches 
                        Otherwise turn into xml: */
var message = '\t<' + ret[1].replace(/\[1]/g, '').replace(/\/@(\w+)/i, ' $1=""') + '></' + ret[1].match(/[A-Z_]\w*/i)[0] + '>';

alert('xml: \'' + message + '\''); /*
alert("xml: '" + message + "'"); // */

var line = prompt('How do line-comments start? (e.g. //)', '//');

// do something with line

This code is nonsense, but how do I do the right thing in each of the cases of the above JavaScript?

The only thing I found that comes close is this: Comments in string and strings in comments where Jan Goyvaerts himself answered with a similar approach. But that one doesn't handle apostrophe-escaping yet.

Community
  • 1
  • 1
asontu
  • 4,548
  • 1
  • 21
  • 29
  • (I'm noticing StackOverflow is doing a pretty good job highlighting the above, wondering if they use something similar already) – asontu Aug 20 '14 at 10:16

1 Answers1

2

I've broken the regex into 4 lines corresponding with the 4 paths in the graph, don't keep those line-breaks in there if you ever use this.

(['"])(?:(?!\1|\\).|\\.)*\1|
\/(?![*/])(?:[^\\/]|\\.)+\/[igm]*|
\/\/[^\n]*(?:\n|$)|
\/\*(?:[^*]|\*(?!\/))*\*\/

Regular expression visualization

Debuggex Demo

This code grabs 4 types of "blocks" that can contain the other 3. You can iterate through this and do with each one whatever you want or discard it because it's not the one you wanna do anything to.

This one is specific for JavaScript as it's a language I'm familiar with. But you could easily adapt this to the language of your preference.

Anyone see a way in which this code breaks?

Edit I have since been notified that the general pattern is described very well here: https://stackoverflow.com/a/23589204/2684660, neato!

Community
  • 1
  • 1
asontu
  • 4,548
  • 1
  • 21
  • 29