0

I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:

\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}

In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:

'(\\.+\[*.*\]*\{.*\})'

In fact, a sample program is working:

string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)

>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']

Well, to be honest, I would prefer an output like this:

>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]

But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:

dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']

As you can see, the result is quite different from the one that I would like:

[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']

I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.

Thank you!

ssc-hrep3
  • 15,024
  • 7
  • 48
  • 87
Am Ma
  • 1
  • 2
    Latex commands can be nested ([sample](http://tex.stackexchange.com/questions/6659/nested-commands-and-their-arguments) from stack exchange). Therefore they [cannot be parsed with regular expressions](http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns). – spectras Jan 27 '17 at 16:05
  • If they can be nested they can be parsed with PyPi regex module that supports recursion. To a certain extent though. A regular parser would be way more efficient and safe. – Wiktor Stribiżew Jan 27 '17 at 16:19

1 Answers1

0

You can accomplish this simply with github.com/alvinwan/TexSoup. This will give you what you want, albeit with whitespaces preserved.

>>> from TexSoup import TexSoup
>>> string = "\documentclass[this is an option]{this is a text} this is other text ..."
>>> soup = TexSoup(string)
>>> list(soup.contents)
[\documentclass[this is an option]{this is a text}, ' this is other text ...']
>>> string2 = string + "\emph{tt}"
>>> soup2 = TexSoup(string2)
[\documentclass[this is an option]{this is a text}, ' this is other text ...', \emph{tt}]

Disclaimer: I know (1) I'm posting over a year later and (2) OP asks for regex, but assuming the task is tool-agnostic, I'm leaving this here for folks with similar problems. Also, I wrote TexSoup, so take this suggestion with a grain of salt.

Alvin Wan
  • 540
  • 3
  • 11