Use case: I have a large bulk of text I need to match, line by line, against a large list of terms to audit for consistency etc.
What I do: Take the term list, order by length (desc) to ensure longer matches precede potential substring matches, and join them into a monster \b(capture|these|words|list)\b
. This is all working basically fine. Generating various case variants for each term would bloat the regex beyond ridiculous; so terms are matched as i
case-insensitive.
Problem: There are some terms that should only ever be matched in uppercase or title case, e.g. IT
or WHO
, Will
or Sandy
. If every it
and will
is also matched, it generates a ridiculous volume of noise matches to review. Having "Name" terms confused at the start of a sentence is not really an issue here; and there's no technical way to tell them apart anyway.
Required: I need to come up with a case-insensitive regex with a large capture group, inside which indicated options are handled as case-sensitive instead. As far as I can tell, there are no such modifiers that could be attached to individual capture groups or parts thereof. What other approaches do we have in the toolbox?
Sample Code:
$rx = '~\b(iterations|coffee|random|Will|IT)\b~i';
$texts = [
'Complex iterations, coffee and IT infrastructure', // should match "iterations", "IT"
'We will call them Iterations later on', // should match "Iterations"
'On matters of IT, we have pondered much', // should match "IT"
'No matched bits and it is for the great good.' // should not match
];
$result = array_map(function($text) use ($rx) {
preg_match_all($rx, $text, $match);
return $match;
}, $texts);
var_dump($result);
Any solutions? Aside a second round of matching to filter out unwanted cases with a case-sensitive regex? It would mean checking for matches with an "unwanted" case of certain terms, but also do not have other "wanted" terms. Doable, but I would prefer if the business logic here could be kept to a single regex, to keep this more portable.
P.S. I've read Combine case sensitive and insensitive regex... and it doesn't cover this.
Update: It is possible to use inline modifiers in capture groups! Thanks, Wiktor. For my case, (?-i:Will|IT))
would do it. The global i
flag is canceled with-i
for a specific capture group. There's an excellent answer on Can you make just part of a regex case-insensitive? with more details and implementation in various languages.