I am having a hard time figuring out how to perform a regex substitution to clean up some text in a LaTeX file. The LaTeX file looks like
\chapter{\texorpdfstring{{II} {The Chapter
Title}}{II The Chapter Title}}
Annoyingly, this is a multi-line chapter declaration, and the new line can occur virtually anywhere. I can't use the common <>
idioms to just read the file line by line and perform the straight-forward regular expression.
Instead, I am trying this:
#!/usr/bin/perl -i.old # In-place edit, backup as '.old'
use strict;
use warnings;
use Path::Tiny;
my $filename = shift or die "Usage: $0 FILENAME";
my $content = path($filename)->slurp_utf8;
$content =~ s|\\chapter\{.*\{[IVXLCDM]*\s*(.*)\}\}|\\chapter{$1}|gms;
path($filename)->spew_utf8($content);
However, the regular expression is far too greedy, and begins a match at the first \chapter
declaration and ends it at the last chapter
declaration. All I want is to
- remove the
\texorpdfstring
. - remove the roman numeral
- remove the multiple appearances of the chapter title
so that my substitution on
\chapter{\texorpdfstring{{I} {The First
Chapter}}{I The First Chapter}}
It was the best of times.
\chapter{\texorpdfstring{{II} {The Second
Chapter}}{II The Second Chapter}}
It was the worst of times.
results in
\chapter{The First Chapter}
It was the best of times.
\chapter{The Second Chapter}
It was the worst of times.
What can I do now?
Edit: I changed the demo text.
If I understood @zdim correctly, he wrote down the substitution without escaping the braces {}'s, to make it easier to validate. Fair enough. I tried @zdim's solution but it output:
\chapter{The First
Chapter}
It was the worst of times.