For the Google Prettify syntax highlighter for the Wolfram Language, I need to match all identifiers against a large list of about 7000 built-in function names to highlight them as keywords. In the past, I simply used a regex consisting of many alternations. To give a concrete example, here are all functions that start with Plot
:
(:?Plot|Plot3D|Plot3Matrix|PlotDivision|PlotJoined|PlotLabel|PlotLabels|PlotLayout|PlotLegends|PlotMarkers|PlotPoints|PlotRange|PlotRangeClipping|PlotRangeClipPlanesStyle|PlotRangePadding|PlotRegion|PlotStyle|PlotTheme)
Theoretically, this is a bad choice because sub-matches need to be done repeatedly for each alternative keyword. What one could do is to build a Trie (or prefix tree) and construct a regex from it. This would skip whole all sub-matches if the prefix is not correct. For the above words, this would look like this
(:?Plot(?:3(?:D|Matrix)|Division|Joined|L(?:a(?:bel(?:s)?|yout)|egends)|Markers|Points|R(?:ange(?:Clip(?:PlanesStyle|ping)|Padding)?|egion)|Style|Theme)?)
Although this looks a bit messy, the idea is simple: it tests the prefix Plot
and if doesn't match, no further testing is required. If it does match, then it jumps into the inner expression and tests the next prefixes until it either has a complete match or not.
I have timed regex for all 7000 keywords matching against themselves and 7000 dictionary words as negative examples with Kotlin. The Trie implementation needs 78ms, while the alternation regex needs 523ms. That's the vast improvement I expected.
However, in JavaScript, the Trie implementation seems to be a bit slower. For the purpose of profiling, I set up 2 webpages that call the prettify engine with the two different methods. Here is the Trie implementation and here is the alternation implementation. I then used Chrome dev tools to profile this but I'm not particularly experienced with JS.
Questions:
- Can someone explain, why the Trie regex appears to be slower when (a) the regex itself is smaller (although heavily nested) and (b) the regex should in theory make many shortcuts when matching?
- Given the two test-sites or even the pure regex here and here, can someone tell me how to properly profile this? The list of keywords and dict-words are available here.