Perl Regular Expression for extracting multi-line LaTeX chapter name

Question

I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like

\chapter{\texorpdfstring{{II} {The Chapter 
Title}}{II The Chapter Title}}

Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <> idioms to just read the file line by line and perform the straight-forward regular expression.

Instead, I am trying this:

#!/usr/bin/perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;

use Path::Tiny;

my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;

$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);

However, the regular expression is far too greedy, and begins a match at the first \chapter declaration and ends it at the last chapter declaration. All I want is to

remove the \texorpdfstring.
remove the roman numeral
remove the multiple appearances of the chapter title

so that my substitution on

\chapter{\texorpdfstring{{I} {The First 
Chapter}}{I The First Chapter}}

It was the best of times.

\chapter{\texorpdfstring{{II} {The Second 
Chapter}}{II The Second Chapter}}

It was the worst of times.

results in

\chapter{The First Chapter}

It was the best of times.

\chapter{The Second Chapter}

It was the worst of times.

What can I do now?

Edit: I changed the demo text.

If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:

\chapter{The First
Chapter}

It was the worst of times.

Can I recommend you click on the [regex] tag, then Learn More… https://stackoverflow.com/tags/regex/info then go to one of the sandboxes, and try out your sample input with a regular expression you iteratively work on. — dlamblin, Jan 30 '18 at 05:28
The last, added part, exposes a careless bug in my answer, whereby the trailing `.*` matched (and removed!) all the rest. Sorry about that (please leave a comment and tag the user when something like this happens; I accidentally noticed the addition to the question). I edited the answer and fixed that. — zdim, Feb 19 '18 at 04:32

zdim · Accepted Answer · 2018-02-19T04:29:35.293

If you can only have the shown pairs of {...}

s/\\chapter{\\texorpdfstring{{ .*? }\s*{ (.*?) }}\s*{.*?}}/\\chapter{$1}/gsx;

or

s/(\\chapter){\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/${1}{$2}/gs;

where ${1} (for $1) is needed for syntax, as $1{... would be interpreted as a value of %1.

Or, rather

s/\\chapter\K{\s*\\texorpdfstring{{.*?}\s*{(.*?)}}\s*{.*?}}/{$1}/gs

where the \K form of lookbehind drops previous matches. I still leave { to retype for a possibly clearer replacement part.

Please sprinkle this with \s* where there may be spaces.

Also note the Path::Tiny::edit_utf8

path($filename)->edit_utf8( sub { s/.../.../gs } );  # regex as above

which applies the anonymous sub to the slurped file, as opposed to edit_lines.

If the braced expressions can be nested more freely (say with {\em ... } and such) a far more systemic approach is needed. See for example Text::Balanced and search for "nested delimiters."

Some regex resources

Perl documentation

perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)

Stackoverflow

Regex info An entry portal with resources
Reference: What does this regex mean? A gargantuan list of FAQs with links to SO posts
Learning Regular expressions An overview with a long list of resources at the end

Regular-Expressions.info

Perl Regular Expression for extracting multi-line LaTeX chapter name

1 Answers1

Linked