0

I just started to write a C program converting some LaTeX into HTML code. The best way in my opinion is to use regular expressions, yet I cannot make this simple idea work with PCRE: Replace something like \term{abc} by [pre]abc[/pre] (\term is a Latex command of my own). Right now here's the catch:

  1. How do I handle escaped curly braces (\}) in \term?
  2. How do I handle pairs like {}?
  3. How do I make the regular expression so greedy that it consumes the first of many \term commands, but not all of them?

Well, many questions to figure it out. Hope somebody can help?

PS: I'm sorry if, in any case, I have overlooked an answer to a similar question...

Tim
  • 5,024
  • 2
  • 30
  • 58
smiter
  • 3
  • 1
  • 3
    These are really three separate questions. You will likely get better responses if you break this up. – Tim Jan 18 '12 at 20:31

2 Answers2

2

See perlfaq6(1) for "Can I use Perl regular expressions to match balanced text?". That said, since latex's complexity seems similar (if not worse) than (x)html, you might want to heed the words of RegEx match open tags except XHTML self-contained tags .

Community
  • 1
  • 1
jørgensen
  • 10,149
  • 2
  • 20
  • 27
  • I knew that answer would get a mention. I could smell Cthulu. – Tim Jan 18 '12 at 21:14
  • Sigh, I was kinda hoping to avoid writing a "real" LaTeX parser and be able to work with PCRe instead. Seems my gut feeling was right in the first place... – smiter Jan 19 '12 at 09:22
0

I don't know exactly what you need, but you might consider htlatex (part of TeX4HT), pandoc or any of several other options. TeX is notoriously hard to parse.

Ivan Andrus
  • 5,221
  • 24
  • 31