0

Use case: I have a large bulk of text I need to match, line by line, against a large list of terms to audit for consistency etc.

What I do: Take the term list, order by length (desc) to ensure longer matches precede potential substring matches, and join them into a monster \b(capture|these|words|list)\b. This is all working basically fine. Generating various case variants for each term would bloat the regex beyond ridiculous; so terms are matched as i case-insensitive.

Problem: There are some terms that should only ever be matched in uppercase or title case, e.g. IT or WHO, Will or Sandy. If every it and will is also matched, it generates a ridiculous volume of noise matches to review. Having "Name" terms confused at the start of a sentence is not really an issue here; and there's no technical way to tell them apart anyway.

Required: I need to come up with a case-insensitive regex with a large capture group, inside which indicated options are handled as case-sensitive instead. As far as I can tell, there are no such modifiers that could be attached to individual capture groups or parts thereof. What other approaches do we have in the toolbox?

Sample Code:

$rx = '~\b(iterations|coffee|random|Will|IT)\b~i';

$texts = [
    'Complex iterations, coffee and IT infrastructure', // should match "iterations", "IT"
    'We will call them Iterations later on', // should match "Iterations"
    'On matters of IT, we have pondered much', // should match "IT"
    'No matched bits and it is for the great good.' // should not match
];

$result = array_map(function($text) use ($rx) {
    preg_match_all($rx, $text, $match);
    return $match;
}, $texts);

var_dump($result);

Any solutions? Aside a second round of matching to filter out unwanted cases with a case-sensitive regex? It would mean checking for matches with an "unwanted" case of certain terms, but also do not have other "wanted" terms. Doable, but I would prefer if the business logic here could be kept to a single regex, to keep this more portable.


P.S. I've read Combine case sensitive and insensitive regex... and it doesn't cover this.

Update: It is possible to use inline modifiers in capture groups! Thanks, Wiktor. For my case, (?-i:Will|IT)) would do it. The global i flag is canceled with-i for a specific capture group. There's an excellent answer on Can you make just part of a regex case-insensitive? with more details and implementation in various languages.

Markus AO
  • 4,771
  • 2
  • 18
  • 29
  • 1
    `$rx = '~\b(iterations|coffee|random|(?-i:Will|IT))\b~i';` – Wiktor Stribiżew Jan 12 '22 at 12:06
  • OMG it _is_ possible to use inline modifiers. Thank you @WiktorStribiżew! Spent a fair while looking, how come that didn't show up. I may need a searching refresher. I have closed this question, and whomever with the powers, could you please add top a "duplicate" link with this as the answer: https://stackoverflow.com/questions/43632/can-you-make-just-part-of-a-regex-case-insensitive/58818125 – Markus AO Jan 12 '22 at 12:13
  • Flagged moderator attention. FYI: whoever closed this logged a duplicate that was _explicitly stated_ in the OP as _not addressing the issue_. There is an actual answer that addresses the question linked in my comment above. *No need to reopen, just please update the dupe* so this question retains some referential value. – Markus AO Jan 12 '22 at 14:01

0 Answers0