How can I use regex with sed (or equivalent unix command line tool) to fix title case in LaTeX headings?

Question

regular expression attempt

(\\section\{|\\subsection\{|\\subsubsection\{|\\paragraph[^{]*\{)(\w)\w*([ |\}]*)

search text

\section{intro to installation of apps}
\subsection{another heading for \myformatting{special}}
\subsubsection{good morning, San Francisco}
\paragraph{installation of backend services}

desired output

All initial characters are capitalized except prepositions, conjunctions, and the usual parts of speech that are made upper case on titles.

I supposed I should really narrow this down, so let me borrow from the U.S. Government Printing Office Style Manual:

The articles a, an, and the; the prepositions at, by, for, in, of, on, to, and up; the conjunctions and, as, but, if, or, and nor; and the second element of a compound numeral are not capitalized.

Page 41

\subsection{Installation guide for the server-side app \myapp{webgen}}

changes to

\subsection{Installation Guide for the Server-side App \myapp{Webgen}}

OR

\subsection{Installation Guide for the Server-side App \myapp{webgen}}

How would you name this type of string modification?

Applying REGEX to a string between strings?
Applying REGEX to a part of a string when that part falls between two other strings of characters?
Applying REGEX to a substring that occurs between two other substrings within a string?
<something else>

problem

I match each latex heading command, including the {. This means that my expresion does not match more than the first word in the actually heading text. I cannot surround the whole heading code with an "OR space" because then I will find nearly every word in the document. Also, I have to be careful of formatting commands within the headings themselves.

other helpful related questions

@MarcusMüllerꕺꕺ I appreciate your enthusiasm and quick replies. I do not see how parsing my `.toc` file relates to the question, however. — Jonathan Komar, Aug 09 '15 at 23:07
much much simpler to parse, imho, as a simplification; I'm not convinced that'll help you, on a second read; I read *how to ensure* as *how do I test*, but you probably want *how do I automatically fix*. — Marcus Müller, Aug 10 '15 at 15:53
See the Perl module [`Lingua::EN::Titlecase`](https://metacpan.org/pod/Lingua::EN::Titlecase) — Håkon Hægland, Aug 11 '15 at 21:32
How do you want to handle the second line : `\subsection{another heading for \myformatting{special}}`? Like this: `\subsection{Another Heading for \myformatting{Special}}`? — Håkon Hægland, Aug 11 '15 at 22:11
@HåkonHægland Yes, exactly! It would be nice to be able to toggle that, but I can't ask for too much at once :) (And doing this in Perl would be soooo awesome! Although I am just getting started with Perl, I've read several times that it is like `sed` and `awk` on steroids.) — Jonathan Komar, Aug 12 '15 at 06:03

score 2 · Accepted Answer · answered Aug 10 '15 at 01:20

So it seems to me as if you need to implement pseudo-code like this:

Are we on the first word? If yes, capitalize it and move on.
Is the current word "reserved"? If yes, lower it and move on.
Is the current word a numeral? If yes, lower it and move on.
Are we still in the list? If yes, print the line verbatim and move on.

One other helpful rule might be to leave fully upper-case words as they are, just in case they're acronyms.

The following awk script might do what you need.

#!/usr/bin/awk -f

function toformal(subject) {
  return toupper(substr(subject,1,1)) tolower(substr(subject,2))
}

BEGIN {
  # Reserved word list gets split into an array for easy matching.
  reserved="at by for in of on to up and as but if or nor";
  split(reserved,a_reserved," "); for(i in a_reserved) r[a_reserved[i]]=1;
  # Same with the list of compound numerals. If this isn't what you mean, say so.
  numerals="hundred thousand million billion";
  split(numerals,a_numerals," "); for(i in a_numerals) n[a_numerals[i]]=1;
}

# This awk condition matches the lines we're interested in modifying.
/^\\(section|subsection|subsubsection|paragraph)[{]/ {

  # Separate the particular section and the text, then split text to an array.
  section=$0; sub(/\\/,"",section); sub(/[{].*/,"",section);
  text=$0; sub(/^[^{]*[{]/,"",text); sub(/[}].*/,"",text);
  size=split(text,atext,/[[:space:]]/);

  # First word...
  newtext=toformal(atext[1]);

  for(i=2; i<=size; i++) {
    # Reserved word...
    if (r[tolower(atext[i])]) { newtext=newtext " " atext[i]; continue; }
    # Compound numerals...
    if (n[tolower(atext[i])]) { newtext=newtext " " tolower(atext[i]); continue; }
#    # Acronyms maybe...
#    if (atext[i] == toupper(atext[i])) { newtext=newtext " " atext[i]; continue; }
    # Everything else...
    newtext=newtext " " toformal(atext[i]);
  }

  print newtext;
  next;

}

# Print the line if we get this far. This is a non-condition with
# a print-only statement.
1

Wow, thanks for the time and effort. It will take me some time to understanding all of the steps. I am not sure I understand line 45 (currently the `1` as of writing this comment) or the `BEGIN {` function. — Jonathan Komar, Aug 10 '15 at 06:58
No worries, I had most of this written for something else I was doing, I just needed to adapt it to your word list. :) As for your question, Awk uses constructs of `condition { statement; ... }`. If no condition is provided, `true` is assumed. If no statement is provided, `{ print; }` is assumed. A `1` evaluates as `true` for awk, so a 1 by itself on the line basically means "print every line". Of course, it only gets run if the *previous* condition failed to match, as its last statement is `next;`. — ghoti, Aug 10 '15 at 07:18
And the BEGIN block contains statements that are run prior to any input lines being matched. It gets used to set up variables that will be needed throughout the rest of the script. You can `man awk` (or ask Google) for additional documentation about how awk works. — ghoti, Aug 10 '15 at 07:20

score 1 · Answer 2 · answered Aug 12 '15 at 09:11

Here is an example of how you could do it in Perl using the module Lingua::EN::Titlecase and recursive regular expressions :

use strict;
use warnings;

use Lingua::EN::Titlecase;

my $tc = Lingua::EN::Titlecase->new();
my $data = do {local $/; <> };

my ($kw_regex) = map { qr/$_/ }
  join '|', qw(section subsection subsubsection paragraph);
$data =~ s/(\\(?: $kw_regex))(\{(?:[^{}]++|(?2))*\})/title_case($tc,$1,$2)/gex;
print $data;

sub title_case {
    my ($tc, $p1, $p2) = @_;

    $p2 =~ s/^\{//;
    $p2 =~ s/\}$//;
    if ($p2 =~ /\\/ ) {
        while ($p2 =~ /\G(.*?)(\\.*?)(\{(?:[^{}]++|(?3))*\})/ ) {
            my $next_pos = $+[0];
            substr($p2, $-[1], $+[1] -$-[1], $tc->title($1));
            substr($p2, $-[3], $+[3] -$-[3], title_case($tc,'',$3));
            pos($p2) = $next_pos;
        }
        $p2 =~ s/\G(.+)$/$tc->title($1)/e;
    }
    else {
        $p2 = $tc->title($p2);
    }
    return $p1 . '{' . $p2 . '}';
}