0

I'm reading a .tex file and replacing according to a pattern for save in another .tex file. My left delimiter is

\ket{

and the right delimiter is

}

The regex \\ket\{(.+)\} can match

\ket{0}

but with complex lines such as

$\ket{\bfG \bfP^L_{2ex}}$, and the real space, $\ket{\bfP^L_{2ex}}$

it matches the entire text

\bfG \bfP^L_{2ex}}$, and the real space, $\ket{\bfP^L_{2ex}

Modifying the regex to

\\ket{([^{}]*|[^}])*}{1,2}

I can detect the mentioned complex line, but in cases such as

reciprocal lattice, $\ket{\bfG \bfP^L_{2ex}{3}{2}}$, and the real space, $\ket{\bfP^L_{2ex}}$

that doesn't work. How can I solve this? What algorithms/topics/books/tutorial must I read to solve problems like this?

zdim
  • 64,580
  • 5
  • 52
  • 81
iaveiga
  • 1
  • 2
  • Step 1: Stop using [regexes](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). https://en.wikipedia.org/wiki/Context-free_grammar#Examples – n0rd Sep 15 '17 at 06:51

1 Answers1

2

I suggest to reach for a tool for handling the (complex) problem of balanced/nested delimiters, instead of attempting to parse it by hand. Perhaps first look at the core Text::Balanced or Regexp::Common. See this post for an example of their use, that also comes very close to what you need.


In this case you may evade the problem, by making use of a specific property of your string.

If this formula is always inline, that is between $...$, then those $'s solve the problem

use warnings;
use strict;
use feature 'say';

my $line = q( 
   $\ket{\bfG \bfP^L_{2ex}}$, and the real space, $\ket{\bfP^L_{2ex}}$ 
);

my @kets = $line =~ m| \$\\ket{ (.+?) }\s*\$ |gx;

say for @kets;

This prints

\bfG \bfP^L_{2ex}
\bfP^L_{2ex}

This is easy since the text you need is simply between the literal $\ket{ and the first next }$; there is no issue of what's inside, so there is no problem with nested delimiters.

The .+? matches all characters up to the following pattern, here }$ (with optional spaces, \s*, just in case). The $ and \ need be escaped. The |x modifier allows spaces for readability.

zdim
  • 64,580
  • 5
  • 52
  • 81