2

I have a dictionary for LaTeX commands/html entities:

var translations = [
    {tex: '\\latex', html: 'LaTeX'},
    {tex: '\\cup', html: '∪'},
    {tex: '\\cap', html: '∩'},
    {tex: '\\ldots', html: '…'},
    {tex: '\\leftarrow', html: '←'},
    {tex: '\\leftrightarrow', html: '↔'}
    ...
];

Now I want to replace each LaTeX command by its html entity. I guess the best basic structure is like this:

function translateFromTexToHTML(string) {
    for (i = 0; i < translations.length; i += 1) {
        re = new RegExp('...\\' + translations[i].tex + '...');
        string = string.replace(re, '...' + translations[i].html);
    }
    return string;
}

Unfortunately, I cannot figure out which regular expression I need. I tried this:

var re = new RegExp('\\' + translations[k].tex + '([^a-zA-Z])', 'g');
string .replace(re, translations[k].html + '$1');

This partly works, for example,

\leftarrow \leftrightarrow becomes ← ↔

But, for example,

\leftarrow\leftrightarrow becomes ←\leftrightarrow instead ←↔

I guess it is because the backslash of the second \cup becomes part of the replacement of the first and hence is not matched anymore.

Also is the basic structure efficient?

Help much appreciated.

Daniel
  • 3,383
  • 4
  • 30
  • 61
  • 1
    I have checked the last regex, and it seems you just consume the letter after the *tex*. Put it into a lookahead: `\\leftarrow(?=[^a-zA-Z])`. Or, just use a word boundary `\\leftarrow\b` (which means *match `w` before a non-word (not `[a-zA-Z0-9_]` character*). That is, `var re = RegExp('\\' + Tools.SVG.translations[k].tex + '\\b', 'g');`. – Wiktor Stribiżew Nov 23 '15 at 08:33
  • Thanks. Unfortunately it does not work for commands at the end of the string. (LaTeX actually accepts `\leftarrow7` as the command `\leftarrow` followed by the (non-command) number `7`. So the word boundary does not work. But the lookahead does.) – Daniel Nov 23 '15 at 08:47
  • Just as a note: I also wanted to remove an optional space at the end of the command, so it becomes possible to have `A\leftarrow B` come out with minimal space as `A←B`: `(\\s|(?![a-zA-Z]))` – Daniel Nov 23 '15 at 11:23
  • You can even use a non-capturing group then: `(?:\\s|(?![a-zA-Z]))` (to keep regex capture buffer clean). – Wiktor Stribiżew Nov 23 '15 at 11:28
  • Thanks. Though I am not sure what it means to 'keep regex capture buffer clean'... – Daniel Nov 23 '15 at 11:40
  • When you specify a capturing group, the regex engine keeps the submatch text inside its buffer. It slows down the regex performance just a tiny bit, so if you need parentheses just for grouping, it is best practice to use a non-capturing group. – Wiktor Stribiżew Nov 23 '15 at 11:42
  • @Daniel would you mind sharing your LaTeX commands/html entities dictionary with me? I'm looking for exactly this! Thanks :) – Maurits Moeys Feb 01 '17 at 10:04
  • @MauritsMoeys Sorry, I had only a very short dictionary implemented and did not continue with it since I went for MathJax instead. – Daniel Feb 01 '17 at 12:42

1 Answers1

1

The issue is that the last subpattern in your regex is a negated character class that actually consumes input leaving no chance of matching the next entity during the next iteration.

Just place it inside a negative lookahead with a non-negated character class:

\\leftarrow(?![a-zA-Z])

or

var re = RegExp('\\' + translations[k].tex + '(?![a-zA-Z])', 'g');

See regex demo

See more on how negative lookahead works (and in general, lookarounds).

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Sorry, I messed up the variable names by copy and paste (`Tools.SVG.translations = translations`). I added an end of string character `$` to the lookahead to also match end of string: `(?=[^a-zA-Z]|$)`. Is that correct? – Daniel Nov 23 '15 at 08:52
  • Use the negative lookahead with a positive character class, you won't need to specify the end of string alternative that way I added a regex demo. – Wiktor Stribiżew Nov 23 '15 at 08:53