0

I'm trying to brush up on my Objective C knowledge and I'm writing a personal HTML page for my notes. I have got a little sidetracked and created a basic Javascript syntax highlighter to highlight the code I'm writing on the page! It's all going well apart from detecting NSStrings. I have used Regex's to detect comments and keywords etc but I just can't figure out how to pickup and highlight the NSString content when there's the possibility of these being escaped like below:

NSString = @"Hello \" world \" string";

So far I have

@"(([^"])*)"

which just stops at the first " character, how can I get it to ignore it if it isn't preceded by a backslash?

Paul Reed
  • 113
  • 14
  • [The
    will not hold it is too late.](http://stackoverflow.com/a/1732454/1348195)
    – Benjamin Gruenbaum Jun 30 '14 at 21:25
  • Do not use regular expressions to do this. Using a single regular expression will not work - what if the string is in comments? What if it's nested? What if the html contains HTML inside attributes? The possibilities are endless. – Benjamin Gruenbaum Jun 30 '14 at 21:27
  • Surely it has to be possible. I looked into using Prettify but wanted to try and do it myself. Prettify's code is quite complex and it doesn't lend itself to learning from very well! – Paul Reed Jun 30 '14 at 21:32
  • HTML and JavaScript are _not_ regular languages. Regular languages are languages you can use a finite state machine to parse - intuitively, those are things that require "finite memory" to parse. However, since HTML and JS can be arbitrarily nested coloring it with something regular is impossible (one can prove this formally via the pumping lemma). You can write your own (possibly, recursive descent) parser for this task rather easily or simply in your case iterating the code and remembering in a state variable whether or not you're in a string or not. – Benjamin Gruenbaum Jun 30 '14 at 21:34

2 Answers2

1

The simple way to handle escapes is like so:

(?:\\.|[^\\"])+

Where \\ becomes a literal backslash, and " is your close quote.

What this does is match either an escaped character, or anything but backslashes and quotes. This means that it skips over quotes that are preceded by a backslash, but it also handles \\\\\\" correctly (hint: that's three backslashes and a close quote).

Feel free to plug in this little gem wherever you need to handle escapes!

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • This is for detecting strings __inside JavaScript__ , for example - this will still find the following string `// "Hello World"` however it is not a JavaScript string but a comment. Not to menton fun stuff like `/*"/*""` and so on. – Benjamin Gruenbaum Jun 30 '14 at 21:29
  • @BenjaminGruenbaum No, this is a regex fragment that can be applied in *any* case where the need to match strings arises. I, for example, use it in my custom BBCode parser script. Not to mention, I would imagine the colour-coding parser would have already processed comment blocks. – Niet the Dark Absol Jun 30 '14 at 21:31
  • Mine parses comment blocks, at least it works on my HTML notes page - it won't go public so if it breaks it's not the end of the world for me! I used '/(\/\*[^\*]*\*\/)/' I just can't adapt it for strings as I only want to count a '"' character if it's not preceded by a '\' and from what I have found so far javascript doesn't have a negative look behind which I think I need. – Paul Reed Jun 30 '14 at 21:41
0

You can use this: (?=((?:[^"\\]+|\\.)*))\1 instead of [^"]*

Note that since, the dot doesn't match newlines by default, you need to use the singleline mode to match "escaped" newlines.

You need to use an emulated atomic group because a string where the double quotes are not balanced will cause a catastrophic backtracking before failing.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I don't think JavaScript supports once-only subpatterns. I could be wrong, though. See my answer for a catastrophic-backtracking-free solution ;) – Niet the Dark Absol Jun 30 '14 at 21:31
  • Hmm, I don't think that'll work either. `\1` is used to match exactly what the first subpattern matched, so... hmm... Actually, I'm not even sure what you regex does now XD – Niet the Dark Absol Jun 30 '14 at 21:34
  • @NiettheDarkAbsol: You can be sure it works, `(?=(subpattern))\1` is the way to emulate an atomic group in javascript or python, since the content of a lookahead is naturaly atomic. – Casimir et Hippolyte Jun 30 '14 at 21:36
  • Aah, I see. Use a lookahead to put it into `\1`, then say "okay, now match it". Gotcha. – Niet the Dark Absol Jun 30 '14 at 21:37
  • @NiettheDarkAbsol: Exactly. It is a totally artificial way, but if you use this trick with python or javascript, you can see a notable gain of performances (even you use an heavy artillery) and you avoid catastrophic backtracking problems you can have with a simple non capturing group. – Casimir et Hippolyte Jun 30 '14 at 21:48
  • So this `@"((?=((?:[^"\\]+|\\.)*))\1)"` is what you're saying should work? I can't get it to pick up anything? – Paul Reed Jun 30 '14 at 21:57
  • @PaulReed: not exactly, I give you an example: `console.log(/@"(?=((?:[^"\\]+|\\.)*))\1"/.exec('NSString = @"Hello \\" world \\" string";'));` *(keep in mind that `\1` refers to the first capturing group, if you open a capture group before the one that is inside the lookahead, you must increment the group reference)* – Casimir et Hippolyte Jun 30 '14 at 22:48
  • You are indeed correct, I figured out why it wasn't working. Before adding new spans to strings and comments I'm supposed to remove other spans within them so strings and comments don't contain other highlighting; the one for the string wasn't working! – Paul Reed Jun 30 '14 at 22:53