I want to parse a LaTeX document and mark some of its terms with a special command. Specifically, I have a list of terms, say:
Astah
UML
use case
...
and I want to mark the first occurrence of Astah in the text with this custom command: \gloss{Astah}
. So far, this works (using Python):
for g in glossary:
pattern = re.compile(r'(\b' + g + r'\b)', re.I | re.M)
text = pattern.sub(start + r'\1' + end, text, 1)
and it works fine.
But then I found out that:
- I don't want to match terms following a LaTeX inline comment (so terms preceded by one or more
%
) - and I don't want to match terms inside a section title (that is,
\section{term}
or\paragraph{term}
)
So I tried this:
for g in glossary:
pattern = re.compile(r'(^[^%]*(?!section{))(\b' + g + r'\b)', re.I | re.M)
text = pattern.sub(r'\1' + start + r'\2' + end, text, 1)
but it matches terms inside comments which are preceded by other characters and it also matches terms inside titles.
Is it something about the "greediness" of regexes that I don't understand? or maybe the problem is somewhere else?
As an example, if I have this text:
\section{Astah}
Astah is a UML diagramming tool... bla bla...
% use case:
A use case is a...
I would like to transform it into:
\section{Astah}
\gloss{Astah} is a \gloss{UML} diagramming tool... bla bla...
% use case:
A \gloss{use case} is a...