Well first there is not one "regex", every tool and language has it's own implementation of regex. You won't be able to fulfill your task in most regex flavours, as they don't support the manipulation of the match (converting uppercase to lowercase and vice versa).
However Boost Regex Engine, which is used in Notepad++ (where I tested it) and C++ can do this kind of stuff.
So let's first start with the matching part
\b(?<!^)(a(?:nd?)?|the|to|[io]n|from|with)(?!$)\b|\b(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))(?<=\w)\b|\b(\w)(\w*)\b
You can use this for matching in most regex flavours, if they support lookahead and lookbehind (javascript doesn't). In some you have to double the backslashes (e.g. java). You also need to include modifiers for multiline-match (anchors ^ and $ match the beginning/end of every line) and case-insensitive matching. Notepad++ includes multiline automatically and has a checkbox for case insensity.
I use \b quite often in here, as it checks for the start/end of a word, so we do get only complete words into our match.
Basically I'm checking for 3 different cases:
- One of your Keywords that shall be lowercase, but not at the start of a string (note: checking for to only as part of an infinitive is not possible, as we can not interprete the language)
- A Roman Numeral
- Any other word
So \b(?<!^)(a(?:nd?)?|the|to|[io]n|from|with)(?!$)\b
matches one of your keywords, if it's not at the start ((?<!^)
) and the end ((?!$)
), making use of negative lookahead and lookbehind as well as anchors.
\b(?=[ivxlcdm]+\b)(m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))(?<=\w)\b
matches a Roman numeral. The actual check ((m{0,4}(?:cm|cd|d?c{0,3})(?:xc|xl|l?x{0,3})(?:ix|iv|v?i{0,3}))
) is taken from this answer, so all credits to the author. I added a lookahead ((?=[ivxlcdm]+\b)
) at the start, to ensure that only letters follow, that can build a Roman numeral (this is purely speed optimization) and a lookbehind (?<=\w)
in the end, to make sure we don't match an empty string. (for words like ill, that contain only valid letters, but aren't actually a Roman numeral)
\b(\w)(\w*)\b
matches every word that hasn't matched before, putting the first letter in one capturing group, the others in a second. The split into these groups is needed to convert the first to uppercase and the last to lowercase
The replace is rather simple: \L$1\U$2\U$3\L$4
It makes use of \L
and \U
, that in boost regex force the following letters to be lowercase or uppercase. $1
is a backreference to the first capturing group and so on.
So if we have a sample text like:
a NEw kinD of ScIENce
ONCE IN A WHILE
the world we live in
GHOST in the Shell
To Be Or Not To Be
Louis xiv and Edward IV
In Year mmXII we will all die
ILL till we die
We will convert it into
A New Kind Of Science
Once in a While
The World We Live In
Ghost in the Shell
To Be Or Not to Be
Louis XIV and Edward IV
In Year MMXII We Will All Die
Ill Till We Die