Programmatically converting/parsing LaTeX code to plain text

Question

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However, we also have some plain text outputs, such as an HTML version of the documentation (I already have code to write minimal markup for that) and a non-TeX-enabled plot renderer.

For these I would like to eliminate the TeX markup that is necessary for e.g. representing physical units. This includes non-breaking (thin) spaces, \text, \mathrm etc. It would also be nice to parse down things like \frac{#1}{#2} into #1/#2 for the plain text output (and use MathJax for the HTML). Due to the system that we've got at the moment, I need to be able to do this from Python, i.e. ideally I'm looking for a Python package, but a non-Python executable which I can call from Python and catch the output string would also be fine.

I'm aware of the similar question on the TeX StackExchange site, but there weren't any really programmatic solutions to that: I've looked at detex, plasTeX and pytex, which they all seem a bit dead and don't really do what I need: programmatic conversion of a TeX string to a representative plain text string.

I could try writing a basic TeX parser using e.g. pyparsing, but a) that might be pitfall-laden and help would be appreciated and b) surely someone has tried that before, or knows of a way to hook into TeX itself to get a better result?

Update: Thanks for all the answers... it does indeed seem to be a bit of an awkward request! I can make do with less than general parsing of LaTeX, but the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace matching to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! Well, I can dream... for now regexes are not doing such a terrible job.

You are right, pyparsing of TeX is a brutal thing, but others have made some headway with this. matplotlib contains a pyparsing TeX parser that you can perhaps bend to your purpose. You could also try posting on the pyparsing mail list and see if some of those who have done TeX work in the past might be able to help. — PaulMcG, Jan 25 '11 at 14:12
See http://stackoverflow.com/questions/3610551/math-in-restructuredtext-with-latex. — S.Lott, Jan 31 '11 at 21:00
Thanks: I'll look first in matplotlib... that's also a pre-existing dependency for one of my packages, so if I'm _very_ lucky I can use it via the mpl API! Cheers :) — andybuckley, Jan 31 '11 at 21:10

score 12 · Answer 1 · answered May 03 '18 at 09:11

I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

from TexSoup import TexSoup
soup = TexSoup("""
\begin{document}

\section{Hello \textit{world}.}

\subsection{Watermelon}

(n.) A sacred fruit. Also known as:

\begin{itemize}
\item red lemon
\item life
\end{itemize}

Here is the prevalence of each synonym.

\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}

\end{document}
""")

Here's how to navigate the parse tree.

>>> soup.section  # grabs the first `section`
\section{Hello \textit{world}.}
>>> soup.section.name
'section'
>>> soup.section.string
'Hello \\textit{world}.'
>>> soup.section.parent.name
'document'
>>> soup.tabular
\begin{tabular}{c c}
red lemon & uncommon \\
life & common
\end{tabular}
>>> soup.tabular.args[0]
'c c'
>>> soup.item
\item red lemon
>>> list(soup.find_all('item'))
[\item red lemon, \item life]

Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

score 8 · Answer 2 · answered Jan 25 '11 at 14:14

8

A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX) \def command actually extends TeX's syntax. For example, \def\foo #1.{{\bf #1}} will expand \foo goo. into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.

answered Jan 25 '11 at 14:14

Little Bobby Tables

5,261
2
39
49

1

I can make do with less general parsing than that, but thanks for the reminder! I can restrict the usage to a more reasonable subset of LaTeX -- the reason for considering a parser rather than a load of regexes in a loop is that I want to be able to handle nested macros and multi-arg macros nicely, and get the brace mathing to work properly. Then I can e.g. reduce txt-irrelevant macros like \text and \mathrm first, and handle txt-relevant ones like \frac last... maybe even with appropriate parentheses! – andybuckley Jan 31 '11 at 20:55

codeMonkey · Answer 3 · 2021-01-30T16:26:10.647

Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:

from pylatexenc.latex2text import LatexNodes2Text


LatexNodes2Text().latex_to_text(r"""\
\section{Euler}
\emph{This} bit is \textbf{very} clever:
\begin{equation}
    \mathrm{e}^{i \pi} + 1 = 0  % wow!!
\end{equation}
where
\[
\mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
\]
""")

which produces


§ EULER

This bit is very clever:

    e^i π + 1 = 0

where

    e = lim_n →∞(1 + 1/n)^n

As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

score 2 · Answer 4 · answered Jan 25 '11 at 10:09

2

Try detex (shipped with most *TeX distributions), or the improved version: http://code.google.com/p/opendetex/

Edit: oh, I see you tried detex already. Still, opendetex might work for you.

answered Jan 25 '11 at 10:09

abesto

2,331
16
28

I hadn't seen opendetex before -- it looks much better, and maybe their parser can be hooked into and extended to do more structured things with commands in math mode. Thanks. – andybuckley Jan 31 '11 at 21:52

score 2 · Answer 5 · answered Jan 25 '11 at 12:14

2

I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.

[1]: http://johnmacfarlane.net/pandoc/index.html .

answered Jan 25 '11 at 12:14

Eduardo Leoni

8,991
6
42
49

1

I wish that being in Haskell wasn't a problem, but it is: I can't really distribute code which relies on a non-standard program and users with a Haskell compiler! As far as I can tell there are no real Python-Haskell bindings, either, which is not a killer but doesn't help :) I will use it privately, though -- thanks! – andybuckley Jan 31 '11 at 20:50

score 1 · Answer 6 · answered Jan 25 '11 at 11:13

1

As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.

answered Jan 25 '11 at 11:13

SK-logic

9,605
1
23
35

Thanks for the comments. Actually, we have thousands of labels to parse (this is optimised a bit for plot generation, and we'd like to speed it up a bit more). But very simple LaTeX documents might process acceptably fast, and batching several labels in one TeX doc might be do-able -- I'll give it a go. AFAIK the startup time of LaTeX is likely to dominate in this case, so something like the LaTeX daemon that had been worked on in PyTeX would be useful... if only that project were still alive! – andybuckley Jan 31 '11 at 21:49

score 0 · Answer 7 · answered Feb 01 '11 at 02:58

0

Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?

answered Feb 01 '11 at 02:58

Joel Berger

20,180
5
49
104

score -5 · Answer 8 · answered Jan 25 '11 at 10:59

-5

LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks

This is your mistake. You shouldn't have done that.

Use RST or some other -- better -- markup language.

Use Docutils to create LaTeX and HTML from the RST source.

answered Jan 25 '11 at 10:59

S.Lott

384,516
81
508
779

4

Thanks for your comments! It's not a mistake, though -- the software is for use in academic physics and we use LaTeX for its math parsing/rendering -- probably 50% or more of the encoded text is math -- and output which can be used seamlessly in (LaTeX-prepared) publications. So while I might agree re. RST in text-dominated cases where very detailed control over formatting isn't required, this use case is pretty much the opposite and LaTeX is much better fitted to the application and the user community. It's just awkward to do flexible things with it... – andybuckley Jan 31 '11 at 20:38
@andybuckley: RST supports LaTeX math. I've used it. I prefer the support in sphinx (http://sphinx.pocoo.org/). See this related question http://stackoverflow.com/questions/3610551/math-in-restructuredtext-with-latex for more helpful advice. – S.Lott Jan 31 '11 at 20:59
@andybuckley: "It's not a mistake". If it doesn't work, there has to be a mistake somewhere. If there's no mistake, it must work perfectly. If it works perfectly, why ask a question? – S.Lott Jan 31 '11 at 21:05
I guess there's a misunderstanding -- my TeX fragments are being used principally as input to a plotting system which has very little overlap with RST's domain of applicability. I would _secondarily_ like to extract a reasonably-readable plain text version of these math-dominated fragments for web and command-line display. Sphinx's LaTeX support is, as far as I can tell, done by forking LaTeX and dvipng processes: that won't help with e.g. rendering \frac{a}{b/c} as a/(b/c). – andybuckley Jan 31 '11 at 21:39
A third option is that there is no obvious, off-the-shelf way to satisfy all the requirements simultaneously! Finding out whether there is a _non_-obvious, but still off-the-shelf method was the reason to ask the question :) My system doesn't work perfectly, but LaTeX was a sensible choice for our most fundamental use cases -- TeX -> txt parsing are desirable as a nice usability feature for less critical aspects of the applications. But I didn't want to burden the OP with all that detail! – andybuckley Jan 31 '11 at 21:45

Programmatically converting/parsing LaTeX code to plain text

8 Answers8

Linked