Logic
I suggest the following approach that combines a regex and common string processing logic:
- use
preg_match
with the appropriate regex to match the first occurrence of the whole chunk of text starting with the keyword from the $terms
array till the last number (+ optional section letter) related to the term
- once the match is obtained, create an array that includes the input string, the match value, and the post-processed match
- post-processing can be done by removing spaces in between hyphenated numbers and rebuilding numeric ranges in case of numbers joined with
+
, &
or ,
chars. This requires a multi-step operation: 1) match the hyphen-separated substrings in the previous overall match and trim off unnecessary zeros and whitespace, 2) split the number chunks into separate items and pass them to a separate function that will generate the number ranges
- the
buildNumChain($arr)
function will create the number ranges and if a letter follows a number, will convert it to a section X
suffix.
Solution
You may use
$strs = ['c0', 'c0-3', 'c0+3', 'c0 & 9', 'c0001, 2, 03', 'c01-03', 'c1.0 - 2.0', 'chapter 2A Hello', 'chapter 2AHello', 'chapter 10.4c', 'chapter 2B', 'episode 23.000 & 00024', 'episode 23 & 24', 'e23 & 24', 'text c25.6 text', '001 & 2 & 5 & 8-20 & 100 text chapter 25.6 text 98', 'hello 23 & 24', 'ep 1 - 2', 'chapter 1 - chapter 2', 'text chapter 25.6 text', 'text chapters 23, 24, 25 text','text chapter 23, 25 text', 'text chapter 23 & 24 & 25 text','text c25.5-30 text', 'text c99-c102 text', 'text chapter 1 - 3 text', '33 text chapter 1, 2 text 3','text chapters 23, 24, 25, 29, 31, 32 text', 'c19 & c20', 'chapter 25.6 & chapter 29', 'chapter 25+c26', 'chapter 25 + 26 + 27'];
$terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', ''];
usort($terms, function($a, $b) {
return strlen($b) - strlen($a);
});
$chapter_main_rx = "\b(?|" . implode("|", array_map(function ($term) {
return strlen($term) > 0 ? "(" . substr($term, 0, 1) . ")(" . substr($term, 1) . "s?)": "()()" ;},
$terms)) . ")\s*";
$chapter_aux_rx = "\b(?:" . implode("|", array_map(function ($term) {
return strlen($term) > 0 ? substr($term, 0, 1) . "(?:" . substr($term, 1) . "s?)": "" ;},
$terms)) . ")\s*";
$reg = "~$chapter_main_rx((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:$chapter_aux_rx)?(?4))*)~ui";
foreach ($strs as $s) {
if (preg_match($reg, $s, $m)) {
$p3 = preg_replace_callback(
"~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui", function($x) use ($chapter_aux_rx) {
return (isset($x[3]) && strlen($x[3])) ? buildNumChain(preg_split("~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui", $x[0]))
: ((isset($x[1]) && strlen($x[1])) ? ($x[1] + 0) : "") . ((isset($x[2]) && strlen($x[2])) ? ord(strtolower($x[2])) - 96 : "") . "-";
}, $m[3]);
print_r(["original" => $s, "found_match" => trim($m[0]), "converted" => $m[1] . $p3]);
echo "\n";
} else {
echo "No match for '$s'!\n";
}
}
function buildNumChain($arr) {
$ret = "";
$rngnum = "";
for ($i=0; $i < count($arr); $i++) {
$val = $arr[$i];
$part = "";
if (preg_match('~^(\d+(?:\.\d+)?)([A-Z]?)$~i', $val, $ms)) {
$val = $ms[1];
if (!empty($ms[2])) {
$part = ' part ' . (ord(strtolower($ms[2])) - 96);
}
}
$val = $val + 0;
if (($i < count($arr) - 1) && $val == ($arr[$i+1] + 0) - 1) {
if (empty($rngnum)) {
$ret .= ($i == 0 ? "" : " & ") . $val;
}
$rngnum = $val;
} else if (!empty($rngnum) || $i == count($arr)) {
$ret .= '-' . $val;
$rngnum = "";
} else {
$ret .= ($i == 0 ? "" : " & ") . $val . $part;
}
}
return $ret;
}
See the PHP demo.
Main points
- Match
c
or chapter
/chapters
with numbers that follow them, capture just c
and the numbers
- After matches are found, process Group 2 that contains the number sequences
- All
<number>-c?<number>
substrings should be stripped of whitespaces and c
before/in between numbers and
- All
,
/&
-separated numbers should be post-processed with buildNumChain
that generates ranges out of consecutive numbers (whole numbers are assumed).
The main regex will look like if $terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', '']
:
'~(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())\s*((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*)~ui'
See the regex demo.
Pattern details
(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())
- a branch reset group that captures the first letter of the search term and captures the rest of the term into an obligatory Group 2. If there is an empty term, the ()()
are added to make sure the branches in the group contain the same number of groups
\s*
- 0+ whitespaces
((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*c?(?3))*)
- Group 2:
(\d+(?:\.\d+)?(?:[A-Z]\b)?)
- Group 3: 1+ digits, followed with an optional sequence of .
, 1+ digits and then an optional ASCII letter that should be followed with a non-word char or end of string (note the case insensitive modifier will make [A-Z]
also match lowercase ASCII letters)
(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*
- zero or more sequences of
\s*(?:[,&+-]|and)\s*
- a ,
, &
, +
, -
or and
enclosed with optional 0+ whitespaces
(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)
- any of the terms with added optional Plural endings s
(?4)
- Group 4 pattern recursed / repeated
When the regex matches, the Group 1 value is c
, so it will be the first part of the result. Then,
"~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui"
is used inside preg_replace_callback
to remove whitespaces in between -
(if any) and terms (if any) followed with 0+ whitespace chars, and if Group 1 matches, the match is split with
"~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui"
regex (it matches &
, ,
, +
or and
in between optional 0+ whitespaces followed with 0+ whitespaces and then an optional string, terms followed with 0+ whitespaces) and the array is passed to the buildNumChain
function that builds the resulting string.