I am writing a latex to text converter, and I'm basing my work on top of a well-known Python parser for latex (python-latex). I am improving it day after day, but now I have a problem when parsing multiple commands inside one line. A latex command can be in the following four forms:
\commandname
\commandname[text]
\commandname{other text}
\commandname[text]{other text}
In the assumption that the commands are not split over lines, and that there could be spaces in the text (but not in the command name), I ended up with the following regexp to catch a command in a line:
'(\\.+\[*.*\]*\{.*\})'
In fact, a sample program is working:
string="\documentclass[this is an option]{this is a text} this is other text ..."
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', string)
>>>['', '\\documentclass[this is an option]{this is a text}', ' ', 'this', ' ', 'is', ' ', 'other', ' ', 'text', ' ...']
Well, to be honest, I would prefer an output like this:
>>> [ '\\documentclass[this is an option]{this is a text}', 'this is other text ...' ]
But the first one can work anyway. Now, my problem arises if, in one line, there are more than one command, like in the following example:
dstring=string+" \emph{tt}"
print (dstring)
\documentclass[this is an option]{this is a text} this is other text ... \emph{tt}
re.split(r'(\\.+\[*.*\]*\{.*\}|\w+)+?', dstring)
['', '\\documentclass[this is an option]{this is a text} this is other text ... \\emph{tt}', '']
As you can see, the result is quite different from the one that I would like:
[ '\\documentclass[this is an option]{this is a text}', 'this is other text ...', '\\emph{tt}']
I have tried to use lookahead and look-back proposition, but since they expect a fixed number of characters, it is impossible to use them. I hope there is a solution.
Thank you!