1

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?

e.g:

text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."

I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".

Thank you!

Ben
  • 113
  • 5
  • 1
    For those readers of your question who do not know LaTex markup syntax, you should provide what the rules are for knowing what should and should not be removed from your text. – Tim Biegeleisen Feb 13 '19 at 06:31
  • Your question can be broken down in two separate questions: 1. How to recognize math expressions in latex files (what is their pattern). 2. How to use the 're' module to remove a substring from a string. Do you have the answer to the first part? Can you let us know, or is it an actual part of your question? – jberrio Feb 13 '19 at 06:33
  • Let me put it this way: is there a string such that re.compile(r'that string') will give me all location of math expression when I run re.finditer ? – Ben Feb 13 '19 at 06:35
  • @Ben You are not answering our questions, and therefore we may not be able to help you. – Tim Biegeleisen Feb 13 '19 at 06:35
  • Whenever I see an expression $something inside$ or $$ something inside$$, they are math expression. – Ben Feb 13 '19 at 06:36
  • @Ben For the given example, you can use the regex [`(\$+)(?:(?!\1)[\s\S])*\1`](https://regex101.com/r/3vGNhE/1). Here is the [code](https://regex101.com/r/3vGNhE/1/codegen?language=python) – Gurmanjot Singh Feb 13 '19 at 06:46

2 Answers2

2

Try this Regex:

(\$+)(?:(?!\1)[\s\S])*\1

Click for Demo

Code

Explanation:

  • (\$+) - matches 1+ occurrences of $ and captures it in Group 1
  • (?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
  • \1 - matches the contents of Group 1 again

Replace each match with a blank string.

As suggested by @torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1

Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • 1
    For this particular case, replace the above with `\${1,2}` since we don't want to match `$$$` as a single symbol. – torek Feb 13 '19 at 06:55
  • I think this expression is good enough for my each purpose. Thanks. – Ben Feb 13 '19 at 18:26
1

It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.

Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.

If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.

Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

torek
  • 448,244
  • 59
  • 642
  • 775