13

We all know by now that parsing HTML using regular expressions is not possible in general, since it'd be parsing a context-sensitive grammar while regular expressions can only parse regular grammars. The same is certainly true for other programming languages.

Now, recently, Rainbow.js syntax highlighter has been announced. Its premise is described as very simple:

Rainbow on its own is very simple. It goes through code blocks, processes regex patterns, and wraps matching patterns in tags.

I figured syntax highlighting is essentially a task of the same complexity as language parsing, if we assume it has to be both good and suitable for many languages. Still, while there is quite a bit of criticism of that library, neither that nor the HackerNews discussion (taken as an example for a discussion by technically-inclined) have mentioned that highlighting syntax using regular expressions is basically impossible in a general case, which I'd consider a major, show-stopping flaw.

Now the question is: is there something I'm missing? In particular:

  1. Is syntax highlighting with regular expressions possible in general?
  2. Is this an instance of an applied 80/20 rule, where just enough is possible with regular expressions to be useful?
Nikolai Prokoschenko
  • 8,465
  • 11
  • 58
  • 97

4 Answers4

15

Syntax highlighting using regexp is an old art. I think even Emacs and vi started this way.

I figured syntax highlighting is essentially a task of the same complexity as language parsing,[...]

No. The difference is: the compiler needs real parsing because it needs to understand the complete program and also needs to generate stuff from that understanding. Syntax highlighting, on the other hand, does not need to understand the code. It merely needs to understand the general structure of the language - what are string literals - what are keywords ... and so on. A side effect of this difference is: you can highlight code which is syntactically incorrect, but you cannot parse it.

A slightly different approach to this: Parsing a language is often a two-step process: lexing (splitting up the byte stream into a "token" stream) and real parsing (bring the token stream into some complex structure - often an Abstract Syntax Tree). Lexing is usually done using ---- regular expressions. See the flex docs for this. And that's basically all a basic syntax highlighter needs to understand.

Of course there are corner cases which regexp alone cannot catch. A typical example is:

foo(bla, bar);

Here foo might be a call to a static method or an instance method or a macro or something else. But your regexp highlighter cannot deduce this. It can only add colors for a "general call".

So: This is a 100/0 percent rule if your requirements are low-level (i.e. without the above example) and typically a 90/10 rule for real world stuff.

Don Hatch
  • 5,041
  • 3
  • 31
  • 48
A.H.
  • 63,967
  • 15
  • 92
  • 126
3

You can do syntax highlighting using regular expressions as part of the solution. More specifically, as part of the "lexer" that chunks the input text into symbols. This is actually the way most compilers/interpreters work.

To do it using regex alone, though, is asking for trouble. Consider the case of matching a string in Python. Python allows strings to be delimited by either single quotes ' or double quotes ". Additionally, it allows multi-line strings ("heredoc syntax") using triple-quotes, ''' or """.

So which parts of the following are strings, and which aren't? Can you construct a regular expression that correctly identifies the string literals str1-str6?

    str1 = "hello, world!"

    str2 = 'hello, world!'

    str3 = "The canonical test program is 'Hello World'."

    str4 = '"Why," Peter said, "That\'s ludicrous. Who would do that?"'

    str5 = """The heredoc syntax is handy for cases where you don't want to escape strings. "Very convenient."
    """

    str6 = """Code sample:
    s1 = "hi!"
    s2 = 'Hi!'
    S3 = '''
    - apples
    - oranges
    - bananas
    '''
    """

The argument that "you can't (parse HTML|process programs) with regex because (HTML|programming languages) have nested structures - they aren't regular" isn't quite true - modern regular expressions (particularly in Perl) have more expressive power than strictly regular expressions in the computer-science sense. But just because you can use regular expressions doesn't mean you should.


Edit: the string-matching problem above isn't too bad if your regex flavour supports backreferences in the search pattern. A multi-line regex like ('|"|'''|""").+?\1 would probably do.


Edit 2: For an example of the corner cases in syntax hilighting, look no further than StackOverflow's syntax highlighting of the code above.

AsukaMinato
  • 1,017
  • 12
  • 21
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • Right, because a lexer defines a context-free grammar (CFG), but a regular expression can only define a regular grammar (RG). – Daniel Mar 31 '12 at 12:58
  • 1
    Use of a "lexer" doesn't necessarily imply a context-free grammar - lexical analysis is something you can do to all types of languages. That said, most programming languages do have context-free grammars. – Li-aung Yip Mar 31 '12 at 13:03
  • 1
    I doubt you even need any fancy features not included in the 19th century mathematical definition of regexes. The back-reference you give can be easily replaced because it can only be once of four things - `('...')|("...")|('''...''')|("""...""")`. What precisely can go into each kind of string is described in the [Python language reference](http://docs.python.org/reference/lexical_analysis.html#string-literals) and seems simple enough to translate to regex. –  Mar 31 '12 at 13:10
  • 1
    Is `str2` meant to finish in a `'`? – huon Mar 31 '12 at 13:10
  • @delnan : You're correct, of course. I will yield that the specific example I gave is probably a non-issue. But there are plenty of other corner cases to deal with (ex. A.H.'s answer.) – Li-aung Yip Mar 31 '12 at 13:12
  • @dbaupp : yes, actually. M'bad. But SO's syntax highlighter still trips on the `"..."` inside the `"""..."""` - probably because it's not applying python-specific rules. – Li-aung Yip Mar 31 '12 at 13:13
  • @Li-aungYip My point is not that the examples you gave are easy. I'm claiming the entire Python string literal syntax is practical to do with a regular expression. Please check out the documentation I linked to. I think I'll also go digging in PyPy's source, I believe they actually generate their tokenizer from regular expressions. –  Mar 31 '12 at 13:14
  • PyPy indeed uses DFAs to generate the entirety of their tokenizer. See [`genpytokenize.py`](https://bitbucket.org/pypy/pypy/src/default/pypy/interpreter/pyparser/genpytokenize.py#cl-146) which seems involved and repetive, but 100% regular and accurate. –  Mar 31 '12 at 13:19
  • Very interesting to see the inside of PyPy's lexer. Makes sense. You wouldn't express a lexer entirely in regular expression syntax because it would be too unwieldy to maintain - but it's precisely equivalent to a regular expression in the end. A very complicated and tedious regex, but equivalent nonetheless. – Li-aung Yip Mar 31 '12 at 13:21
  • How should the regular expression parse and highlight invalid strings such as `str = ' ''' '''`? Granted, for most code, a regular expression is probably good enough... – Daniel Mar 31 '12 at 15:47
  • @Daniel: One of the useful things about syntax highlighters is that they make it obvious when you type something syntactically wrong. Barfing on inputs like this is actually *desirable*. – Li-aung Yip Mar 31 '12 at 16:07
  • @Li-aung Yip I agree, and I add the point that the highlighter needs to make obvious where you typed something wrong and how the compiler is trying to compile it. In other words, it needs to highlight the code in a way consistent with how the compiler tokenizes the code. – Daniel Mar 31 '12 at 16:34
  • @Li-aungYip: [highlight.js](https://github.com/isagalaev/highlight.js) correctly highlight all Python string literals. It is possible by several regexp's, if you try to use them in right order: previosly try to match """string""" and next "string". – Sannis Apr 14 '12 at 12:36
2

Basically, no.

You need a parser/tokenizer that understands the language to pick out which bits to highlight.

Regex doesn't cut the mustard for such a task.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 4
    -1 Wrong: For syntax highlighting, you don't need to *parse* the language, you just need to *tokenize* it. You know, find out which part is a string literal or and identifier or an integer literal or a keyword. Wether there are parens around it, and wether they are balanced, is not relevant. –  Mar 31 '12 at 13:02
  • 3
    Sorry, that is false. Syntax highlighting does need to recognize nesting constructs. It depends on the language. For some languages, merely identifying tokens is not good enough. And some syntax highlighting does recognize unbalanced parentheses: e.g. Vim. – Kaz Mar 31 '12 at 18:13
  • 2
    @delnan I disagree, because it depends what kind of highlighting you want. If it's just keywords OK tokenize, but what if you want to highlight a matching curly bracket, or collapse a method (like Eclipse does) - you need to know about the language. Also, "parse" and "tokenize" have a fairly close meaning, especially when comparing to using regex. I have edited my answer to include "tokenize" and doing so I don't believe changes the basic meaning of my answer. Also, read the highest voted answer - it basically says what I am saying. – Bohemian Mar 31 '12 at 20:04
  • 1
    (1) Highlighting matching brackets and collapsing methods goes beyond most definitions of syntax highlighting (although there are quite a few programs that do all together). (2) Tokenization is *not* "close" to parsing, unless you count "a much simpler and restricted task in the same general direction, frequently used together" as "close". (3) The answer by A.H. is much more detailed, and (correctly) differentiates between the 90% that is best done with regexes and the 10% that one *may* want to do that can't really be done with regexes alone. –  Mar 31 '12 at 20:59
  • @Bohemian is correct. Try to write a regex that matches a js object. You cannot do it with a regex, because a regex is finite. CFGs *can be* infinite. A js object *can have infinitely* nested objects. Basic highlighting can be achieved by a regex, but true syntax highlighting requires the grammar of the language's syntax, hence *language syntax highlighting.* – Rafael Oct 18 '17 at 01:34
  • @Rafael, no, Bohemian is incorrect. Syntax highlighting (depending on how ambitious you want to get) does not necessarily imply needing to recognize whether the code is syntactically valid. See A.H's answer for more explanation of this. – Don Hatch Jan 24 '21 at 03:20
0

A good example to look at is the syntax highlighting implementation in Vim. It uses patterns which are regular-expresion based. However, the patterns are used to recognize hierarchical containment structures in the document, and not simply to tokenize the input.

You can declare regions which begin and end on a regex pattern match (plus another pattern that helps skip the middle material). These regions can declare that they contain other regions or simple patterns. Containment can be recursive. Vim works all of that out. So it is essentially a form of context-free parsing.

This approach can handle languages which have various levels of embedding, with different lexical properties.

For example, I have a language in which there are essentially two sets of keywords (due to a domain language embedding that is going on). The Vim syntax highlighting rules that I wrote recognize the context properly and colorize the keywords differently. Note that there is some overlap between these sets of keywords: same word, different meaning in a different context.

For an example of this see: http://www.kylheku.com/cgit/txr/tree/genman.txr. If you search for the syntax (do you will find that one instance is colored purple, and another green. They are different: one is in a text extraction language, and another in an embedded Lisp dialect. Vim's syntax highlighting is powerful enough to handle a mixture of languages with different sets of keywords. (Yes, though this is served over the web, it's actually a Vim process doing the syntax highlighting.)

Or consider something like the shell, where you can have a string literal type syntax, like "foo bar", but inside there, you can have a command substitution, inside which you have to recursively recognize and colorize shell syntax: "foo $(for x in *; do ...; done) bar".

So no, you can't do useful, accurate syntax higlighting just with regex tokenizing, but regexes with hierarchical parsing can do a good job.

Kaz
  • 55,781
  • 9
  • 100
  • 149