Regular Expression for Titles

Question

I'm new to regular expressions and intrigued at their power. I'd like to come up with a regular expression for generating title case convention. In general, each word in English titles of books, films, and other works takes an initial capital, except for articles ("a", "an", "the"), the word "to" as part of an infinitive, and prepositions and coordinating conjunctions shorter than five letters ("in", "on", "from", "and", "with"), unless they begin or end a title or subtitle.

Having said that, what I want to do is essentially capitalize the first letter of every word in a string (title) except for the words:

a
an
the
to
in
on
from
and
with

These words would have the first letter capitalized when it is the first or last word, otherwise they would be all lower case.

Examples:
   A New Kind of Science     (uppercase A - first word)
   Once in a While           (lowercase a - not first/last word)
   The World We Live In      (uppercase The - first word)
   Ghost in the Shell        (lowercase the - not first/last word)
   To Be or Not to Be        (uppercase and lowercase To, to)

Ideally roman numerals (1-5000) would be all capitalized:

I, II, III, ... (ones)
IV, V, VI, ...  (fives)
IX, X, XI, ...  (tens)
XL, L, LX, ...  (fifties)
XC, C, CX, ...  (hundreds)
XD, D, DX, ...  (five hundreds)
CM, M, MC, ...  (thousands)

For all permutations see: Roman Numerals

Any suggestions on where to start?

I don't think a pure regex solution is the way to go here. Regex is powerful, but it is not the answer for everything. — Tim Biegeleisen, Mar 21 '16 at 01:04
This would involve some amount of programming too. I will suggest you to chose Python for programming because it has got that converting to upper case syntax in it's regex flavor. — , Mar 21 '16 at 01:11
In Notepad++ you could do a search for `\b((?<!^)(?:a|an|the|to|in|on|from|and|with))\b|\b(\w)(\w*)\b` and replace with `\L$1\U$2\L$3`. This doesn't involve roman numerals yet, but I have to go to sleep now - but maybe someone could work further on that. — Sebastian Proske, Mar 21 '16 at 01:17
Regex doesn't "do" anything. It matches strings. It can't change letter case for example. It can't *change* anything. — Bohemian, Mar 21 '16 at 09:27
@Bohemian if you count replace patterns to regex, it is capable of doing a lot of changes in adding characters or leaving out parts of the match. In some it is even possible to do char case conversions. — Sebastian Proske, Mar 21 '16 at 23:55
@seb you can *move* chars around, you can insert or delete (by replacing with nothing), but you can't *change* text (with a regex replace operation). There are some tools (like fancy text editors) that have their own custom extensions that can change case etc, but most app languages don't have these. — Bohemian, Mar 22 '16 at 00:09

score 1 · Answer 1 · answered Mar 21 '16 at 01:18

Regex is powerful, true. But in this case, you would end up one page of regex to define all these rules. Which is not practical.

But I have another idea, if you mind.

Define an array for stopwords (a, an, the, to etc)
Define an array for Roman numerals.
Split all the words for each title
For each title, iterate for each word and check whether the word is a stopword or Roman numeral.
If stopword, lowercase all, if numeral, uppercase all, otherwise uppercase the first letter and lowercase the rest.
Concatenate processed words to get the finalized title.

Less than 50 lines of Java code would do the job.

score 0 · Accepted Answer · edited May 23 '17 at 11:52

Well first there is not one "regex", every tool and language has it's own implementation of regex. You won't be able to fulfill your task in most regex flavours, as they don't support the manipulation of the match (converting uppercase to lowercase and vice versa).

However Boost Regex Engine, which is used in Notepad++ (where I tested it) and C++ can do this kind of stuff.

So let's first start with the matching part

\b(?<!^)(a(?:nd?)?|the|to|[io]n|from|with)(?!$)\b|\b(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))(?<=\w)\b|\b(\w)(\w*)\b

You can use this for matching in most regex flavours, if they support lookahead and lookbehind (javascript doesn't). In some you have to double the backslashes (e.g. java). You also need to include modifiers for multiline-match (anchors ^ and $ match the beginning/end of every line) and case-insensitive matching. Notepad++ includes multiline automatically and has a checkbox for case insensity.

I use \b quite often in here, as it checks for the start/end of a word, so we do get only complete words into our match.

Basically I'm checking for 3 different cases:

One of your Keywords that shall be lowercase, but not at the start of a string (note: checking for to only as part of an infinitive is not possible, as we can not interprete the language)
A Roman Numeral
Any other word

So \b(?<!^)(a(?:nd?)?|the|to|[io]n|from|with)(?!$)\b matches one of your keywords, if it's not at the start ((?<!^)) and the end ((?!$)), making use of negative lookahead and lookbehind as well as anchors.

\b(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))(?<=\w)\b matches a Roman numeral. The actual check ((m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))) is taken from this answer, so all credits to the author. I added a lookahead ((?=[ivxlcdm]+\b)) at the start, to ensure that only letters follow, that can build a Roman numeral (this is purely speed optimization) and a lookbehind (?<=\w) in the end, to make sure we don't match an empty string. (for words like ill, that contain only valid letters, but aren't actually a Roman numeral)

\b(\w)(\w*)\b matches every word that hasn't matched before, putting the first letter in one capturing group, the others in a second. The split into these groups is needed to convert the first to uppercase and the last to lowercase

The replace is rather simple: \L$1\U$2\U$3\L$4 It makes use of \L and \U, that in boost regex force the following letters to be lowercase or uppercase. $1 is a backreference to the first capturing group and so on.

So if we have a sample text like:

a NEw kinD of ScIENce
ONCE IN A WHILE
the world we live in
GHOST in the Shell
To Be Or Not To Be
Louis xiv and Edward IV
In Year mmXII we will all die
ILL till we die

We will convert it into

A New Kind Of Science
Once in a While
The World We Live In
Ghost in the Shell
To Be Or Not to Be
Louis XIV and Edward IV
In Year MMXII We Will All Die
Ill Till We Die

Sebastian - you're a genius. This works perfectly but better than that you explained it. From your explanation I was able to make one other esoteric tweak as well. Thanks for the solution, explanation and insight! — GeekDad66, Mar 21 '16 at 23:47
@GeekDad66 I'm glad it helped. Please note that this is solution is very specific to the boost regex engine and needs actual programming for most others. — Sebastian Proske, Mar 21 '16 at 23:57
Seems like it works for other regex engines. I used 'The Regulator' to test. I did discover a hiccup though the text:: Isn't that nice ::gets garbled due to the ' Any thoughts? — GeekDad66, Mar 22 '16 at 00:41
Still getting the hang of regular expressions. Apparently my little tweak didn't work entirely either. If was trying to catch a cute roman number expression:: 2.v :: as in two and a half. I can grab the: .v :: but lose the 2. And I still haven't figured out the 'T garble — GeekDad66, Mar 22 '16 at 01:06
Finally figured out the:: 2.v :: issue. I appended |\b([1-9][0-9]{0,3}\.)(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))\b but forgot to include the parenthesis around the digit portion thus adding yet another field. Still struggling with the 'T issue though - I know it has to do with the ending (\w*). — GeekDad66, Mar 22 '16 at 01:17
@GeekDad66 Don't know, what you are trying for the `2.v` example, this should already be covered (would convert to `2.V`). For the `'`, you could use `\b(\w)([\w']*)\b` as the last part, matching `'` into the second (overall fourth) group. Btw you shoudl use the markdown formatting to format your strings. — Sebastian Proske, Mar 22 '16 at 17:25
You are right once again - works great. BTW the 2.v wasn't caught - I had to use the `|\b([1-9][0-9]{0,3}\.)(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3}‌)(?:ix|iv|v?i{0,3}))\b` — GeekDad66, Mar 23 '16 at 00:41
Argh - misses the 'an ' condition. I know your `(a(?:nd?)? is the issue. I could simply change to `a|an|and|` but to keep your code elegance what should it be? — GeekDad66, Mar 23 '16 at 01:03
I changed to the `(a(?:nd?)? to a|an|and|` and that portion now works although not as elegent/efficient as your code. I mentioned to a friend and he would like to use for MP3 filenames which would be in the format: `## Title`. I tried a few iterations such as prefixing with `\b[0-9][1-9]{0,3}\b` without success. Any suggestions? — GeekDad66, Mar 31 '16 at 23:41

Regular Expression for Titles

2 Answers2