2

I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.

Instead, I am trying this:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to

  1. remove the \texorpdfstring.
  2. remove the roman numeral
  3. remove the multiple appearances of the chapter title

so that my substitution on

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

results in

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

What can I do now?

Edit: I changed the demo text.


If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:

\chapter{The First
Chapter}

It was the worst of times.
nomen
  • 3,626
  • 2
  • 23
  • 40
  • 1
    Can I recommend you click on the [regex] tag, then Learn More… https://stackoverflow.com/tags/regex/info then go to one of the sandboxes, and try out your sample input with a regular expression you iteratively work on. – dlamblin Jan 30 '18 at 05:28
  • @dlamblin: thank you, I am trying one now. – nomen Jan 30 '18 at 17:11
  • The last, added part, exposes a careless bug in my answer, whereby the trailing `.*` matched (and removed!) all the rest. Sorry about that (please leave a comment and tag the user when something like this happens; I accidentally noticed the addition to the question). I edited the answer and fixed that. – zdim Feb 19 '18 at 04:32

1 Answers1

2

If you can only have the shown pairs of {...}

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

or

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.

Or, rather

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.

Please sprinkle this with \s* where there may be spaces.

Also note the Path::Tiny::edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

which applies the anonymous sub to the slurped file, as opposed to edit_lines.

If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."


Some regex resources

Perl documentation

Stackoverflow

Regular-Expressions.info

zdim
  • 64,580
  • 5
  • 52
  • 81