6

I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say:

Astah
UML
use case
...

and I want to mark the first occurrence of Astah in the text with this custom command: \gloss{Astah}. So far, this works (using Python):

for g in glossary:
    pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(start + r'\1' + end, text, 1)

and it works fine.

But then I found out that:

  • I don't want to match terms following a LaTeX inline comment (so terms preceded by one or more %)
  • and I don't want to match terms inside a section title (that is, \section{term} or \paragraph{term})

So I tried this:

for g in glossary:
    pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M)
    text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)

but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles.

Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else?

As an example, if I have this text:

\section{Astah}
Astah is a UML diagramming tool... bla bla...
% use case:
A use case is a...

I would like to transform it into:

\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Giorgio
  • 2,137
  • 3
  • 20
  • 40

2 Answers2

1

The trick here is to use a regex that starts matching at the start of the line, because that allows us to check if the word we're trying to match is preceded by a comment:

^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b

Requires multi-line flag m. Occurences of this regex are to be replaced with \1\\gloss{\2}.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
  • Great, thank you. But isn't the `*?` on the 3rd and 4th lines redundant? shouldn't it be just `*`? – Giorgio Mar 04 '17 at 15:06
  • @George No, that's crucial to make sure the pattern matches the _first_ occurence of the word. If you make them greedy, it'll match the last occurence instead. – Aran-Fey Mar 04 '17 at 15:14
  • Ok, I see. And also, this regex works fine with a small file, but doesn't end if I apply it to a 1000-line file: maybe it's computationally too expensive. If I don't ask too much, do you think there's a faster equivalent regex? – Giorgio Mar 04 '17 at 15:21
  • @George Yes, I realized that I was overthinking too much... `^([^%\n]*?)(?<!\\section{)(?<!\\paragraph{)\b(Astah)\b` should be more efficient. (Requires multi-line flag `m`) – Aran-Fey Mar 04 '17 at 15:41
  • Yes, and also maybe just `[^%]*?` instead of `[^%\n]*?` should work in multiline mode, if I'm not wrong. Anyway, thanks and update your answer so that I can accept it. – Giorgio Mar 04 '17 at 16:20
0

Here is my two cents :

First, we need to use the regex module by Matthew Barnett. It brings lots of interesting features. And one of its features may be useful in this case, the added (*SKIP) and (*FAIL).

From the documentation :

  • Added (*PRUNE), (*SKIP) and (*FAIL) (Hg issue 153)

(*PRUNE) discards the backtracking info up to that point. When used in an atomic group or a lookaround, it won’t affect the enclosing pattern.

(*SKIP) is similar to (*PRUNE), except that it also sets where in the text the next attempt to match will start. When used in an atomic group or a lookaround, it won’t affect the enclosing pattern.

(*FAIL) causes immediate backtracking. (*F) is a permitted abbreviation.

So, lets build the pattern and test it with the regex module :

import regex

pattern = regex.compile(r'%.*(*SKIP)(*FAIL)|\\section{.*}(*SKIP)(*FAIL)|(Astah|UML|use case)')

s = """
    \section{Astah}
    Astah is a UML diagramming tool... bla bla...
    % use case:
    A use case is a...
"""


print regex.sub(pattern, r'\\gloss{\1}', s)

Output :

\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...

The Pattern :

This sentence illustrates it well :

the trick is to match the various contexts we don't want so as to "neutralize them".

On the left side, we will write the contexts we don't want. And on the right side (last part), we capture what we actually want. So all contexts are separate by an Alternation sign | and the last one (what we want) is captured.

Since in this case, we will perform a replacement, we need to (*SKIP)(*FAIL) to keep intact the matching parts we don't want to replace.

What the pattern means :

%.*(*SKIP)(*FAIL)|\\section{.*}(*SKIP)(*FAIL)|(Astah|UML|use case)

%.*(*SKIP)(*FAIL)              # Matches the pattern but skip and fail
|                              # or
\\section{.*}(*SKIP)(*FAIL)    # Matches the pattern but skip and fail
|                              # or
(Astah|UML|use case)           # Matches the pattern and capture it. 

This simple trick is more detailed on RexEgg.

Hope it helps.

JazZ
  • 4,469
  • 2
  • 20
  • 40