Determining chapter number in different types of text

Question

I'm pulling titles from novel related posts. The aim is, via use of regex, to determine which chapter(s) the post is about. Each site uses different ways of identifying the chapters. Here are the most common cases:

$title = 'text chapter 25.6 text'; // c25.6
$title = 'text chapters 23, 24, 25 text'; // c23-25
$title = 'text chapters 23+24+25 text'; // c23-25
$title = 'text chapter 23, 25 text'; // c23 & 25
$title = 'text chapter 23 & 24 & 25 text'; // c23-25
$title = 'text c25.5-30 text'; // c25.5-30
$title = 'text c99-c102 text'; // c99-102
$title = 'text chapter 99 - chapter 102 text'; // c99-102
$title = 'text chapter 1 - 3 text'; // c1-3
$title = '33 text chapter 1, 2 text 3'; // c1-2
$title = 'text v2c5-10 text'; // c5-10
$title = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32

The chapter numbers are always listed in the title, just in different variations as displayed above.

What I have so far

So far, I have a regex to determine single cases of chapters, like:

$title = '9 text chapter 25.6 text'; // c25.6

Using this code (try ideone):

function get_chapter($text, $terms) {

    if (empty($text)) return;
    if (empty($terms) || !is_array($terms)) return;

    $values = false;

    $terms_quoted = array();
    foreach ($terms as $term)
        $terms_quoted[] = preg_quote($term, '/');

    // search for matches in $text
    // matches with lowercase, and ignores white spaces...
    if (preg_match('/('.implode('|', $terms_quoted).')\s*(\d+(\.\d+)?)/i', $text, $matches)) {
        if (!empty($matches[2]) && is_numeric($matches[2])) {
            $values = array(
                'term' => $matches[1],
                'value' => $matches[2]
            );
        }
    }

    return $values;
}

$text = '9 text chapter 25.6 text'; // c25.6
$terms = array('chapter', 'chapters');
$chapter = get_chapter($text, $terms);

print_r($chapter);

if ($chapter) {
    echo 'Chapter is: c'. $chapter['value'];
}

How do I make this work with the other examples listed above? Given the complexity of this question, I will bounty it 200 points when eligible.

Nice! Every question should come with a *what I've tried so far* heading. +1 — Gary Woods, Jul 16 '18 at 13:26
Given the complexity of this question, I will bounty it 200 points when eligible. — Henrik Petterson, Jul 16 '18 at 13:27
If you don't have a firm set of rules defining your possible input, there's really no way to come up with a one-size-fits-all solution. — Patrick Q, Jul 16 '18 at 13:28
@PatrickQ I do have the rules set. See the examples listed above. Those are all the variations I have seen when processing thousands of posts. =) — Henrik Petterson, Jul 16 '18 at 13:29
This is not a problem to be solved with regex only, specially the cases where you have continuous chapters eg 23,24,25 and must convert it to 23-25. The thing that comes to mind now is to have an array with all possible regex rules (observing of course if there are colisions) do the appropriated treatment for the matched rule. somekind of `{ pattern: ClassToProcess }` that, obviously will take a lot of time depending on how big is your text to process. — Jorge Campos, Jul 16 '18 at 13:36
And also, there will be of course doubts, like in which rule should this situation `'text chapters 23, 24, 25, 29, 31, 32 text'` fit ? — Jorge Campos, Jul 16 '18 at 13:40
@JorgeCampos Your example would convert to: `c23-25 & 29 & 31-32` — Henrik Petterson, Jul 16 '18 at 13:44
So, a new rule... Like @PatrickQ said. You need to define all possible rules otherwise you wouldn't have a solution — Jorge Campos, Jul 16 '18 at 13:46
@JorgeCampos As I highlighted, these are all the possible rules. I processed thousands of titles to determine these rules. The only thing that could differ is that `chapter` is called something else... but as you can see in my code, I have already a solution for that. I totally appreciate that there may be one out of thousand that would be missed, but that's a good ratio for this type of script. — Henrik Petterson, Jul 16 '18 at 13:48
How did you processed it all? You just edited your question to add a new rule, the one I mentioned. And, by your samples, I can think of at least three or four more... Take a good look at it, there are small variations within your, now, rules! Specially the combinations of it. — Jorge Campos, Jul 16 '18 at 13:52
Super-interesting question. One approach is to convert all the values first, like `24, 25, 26` becomes `24-26` and *then* run a (modified) regex. Just a suggestion. I will be watching this Q&A! — Gary Woods, Jul 16 '18 at 13:56
@revo Yes. In this example: `$title = 'text c99-c102 text';` or `$title = 'text chapter 99 - chapter 102 text';` — Henrik Petterson, Jul 16 '18 at 15:07
C'mon man. You keep adding new input possibilities. You can't expect anyone to hit a moving target. — Patrick Q, Jul 16 '18 at 15:13
That's not true. `$title = 'text c99-c102 text';` already existed. And I suppose you can simply do a `str_replace()`of the other variations of the term if needed. =) — Gary Woods, Jul 16 '18 at 15:17
If you don't put the exact requirements into the question (what about `episodes` converted to `e`, `ch`... `part`, `2-3`...) others who don't read each comment of the existing answers will have it difficult to participate. If the requirements change or expand, an update of the question would be desireable. — bobble bubble, Jul 22 '18 at 11:41
@bobblebubble This is why I marked the bounty as rewarding *existing* answer. Although I understand your point. :) — Henrik Petterson, Jul 22 '18 at 18:32
HenrikPetterson, I agree with @bobblebubble that you should add all the possible cases that you have found after you formulated the question (as a new edition at the end of the question if you want). That question is not there only to solve your specific issue, it will be helpful for future readers with your same/a similar problem, so, those users might not understand why you formulated the question and some of the solutions are trying to address complicated cases that are not contained in it. And trying to read all the comments to understand that doesn't make much sense. — ElChiniNet, Jul 23 '18 at 10:50
... _Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems._ ... You absolutely sure you wont need to tweak the rules later on? Or maybe your successor? Will they be able to figure out the regex block and where to make changes in that? — inquisitive, Jul 23 '18 at 19:36
I'll never understand why people poo-poo regex just for the sake of poo-poo'ing it. — mickmackusa, Jul 25 '18 at 11:46
@HenrikPetterson There's an unfortunate disconnect between what you are asking for (and the input samples that you are offering) and the answer that I assume you will accept. Please ensure a level playing field for all volunteers by providing a master "battery" of strings that everyone should be testing with. — mickmackusa, Jul 25 '18 at 11:48

score 13 · Accepted Answer · edited Jun 20 '20 at 09:12

13

Logic

I suggest the following approach that combines a regex and common string processing logic:

use preg_match with the appropriate regex to match the first occurrence of the whole chunk of text starting with the keyword from the $terms array till the last number (+ optional section letter) related to the term
once the match is obtained, create an array that includes the input string, the match value, and the post-processed match
post-processing can be done by removing spaces in between hyphenated numbers and rebuilding numeric ranges in case of numbers joined with +, & or , chars. This requires a multi-step operation: 1) match the hyphen-separated substrings in the previous overall match and trim off unnecessary zeros and whitespace, 2) split the number chunks into separate items and pass them to a separate function that will generate the number ranges
the buildNumChain($arr) function will create the number ranges and if a letter follows a number, will convert it to a section X suffix.

Solution

You may use

$strs = ['c0', 'c0-3', 'c0+3', 'c0 & 9', 'c0001, 2, 03', 'c01-03', 'c1.0 - 2.0', 'chapter 2A Hello', 'chapter 2AHello', 'chapter 10.4c', 'chapter 2B', 'episode 23.000 & 00024', 'episode 23 & 24', 'e23 & 24', 'text c25.6 text', '001 & 2 & 5 & 8-20 & 100 text chapter 25.6 text 98', 'hello 23 & 24', 'ep 1 - 2', 'chapter 1 - chapter 2', 'text chapter 25.6 text', 'text chapters 23, 24, 25 text','text chapter 23, 25 text', 'text chapter 23 & 24 & 25 text','text c25.5-30 text', 'text c99-c102 text', 'text chapter 1 - 3 text', '33 text chapter 1, 2 text 3','text chapters 23, 24, 25, 29, 31, 32 text', 'c19 & c20', 'chapter 25.6 & chapter 29', 'chapter 25+c26', 'chapter 25 + 26 + 27'];
$terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', ''];

usort($terms, function($a, $b) {
    return strlen($b) - strlen($a);
});
 
$chapter_main_rx = "\b(?|" . implode("|", array_map(function ($term) {
    return strlen($term) > 0 ? "(" . substr($term, 0, 1) . ")(" . substr($term, 1) . "s?)": "()()" ;},
  $terms)) . ")\s*";
$chapter_aux_rx = "\b(?:" . implode("|", array_map(function ($term) {
    return strlen($term) > 0 ? substr($term, 0, 1) . "(?:" . substr($term, 1) . "s?)": "" ;},
  $terms)) . ")\s*";

$reg = "~$chapter_main_rx((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:$chapter_aux_rx)?(?4))*)~ui";

foreach ($strs as $s) {
    if (preg_match($reg, $s, $m)) {
        $p3 = preg_replace_callback(
            "~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui", function($x) use ($chapter_aux_rx) {
                return (isset($x[3]) && strlen($x[3])) ? buildNumChain(preg_split("~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui", $x[0])) 
                : ((isset($x[1]) && strlen($x[1])) ? ($x[1] + 0) : "") . ((isset($x[2]) && strlen($x[2])) ? ord(strtolower($x[2])) - 96 : "") . "-";
            }, $m[3]);
        print_r(["original" => $s, "found_match" => trim($m[0]), "converted" => $m[1] . $p3]);
        echo "\n";
    } else {
        echo "No match for '$s'!\n";
    
    }
}

function buildNumChain($arr) {
    $ret = "";
    $rngnum = "";
    for ($i=0; $i < count($arr); $i++) {
        $val = $arr[$i];
        $part = "";
        if (preg_match('~^(\d+(?:\.\d+)?)([A-Z]?)$~i', $val, $ms)) {
            $val = $ms[1];
            if (!empty($ms[2])) {
                $part = ' part ' . (ord(strtolower($ms[2])) - 96);
            }
        }
        $val = $val + 0;
        if (($i < count($arr) - 1) && $val == ($arr[$i+1] + 0) - 1) {
            if (empty($rngnum))  {
                $ret .= ($i == 0 ? "" : " & ") . $val;
            }
            $rngnum = $val;
        } else if (!empty($rngnum) || $i == count($arr)) {
            $ret .= '-' . $val;
            $rngnum = "";
        } else {
            $ret .= ($i == 0 ? "" : " & ") . $val . $part;
        }
    }
    return $ret;
}

See the PHP demo.

Main points

Match c or chapter/chapters with numbers that follow them, capture just c and the numbers
After matches are found, process Group 2 that contains the number sequences
All <number>-c?<number> substrings should be stripped of whitespaces and c before/in between numbers and
All ,/&-separated numbers should be post-processed with buildNumChain that generates ranges out of consecutive numbers (whole numbers are assumed).

The main regex will look like if $terms = ['episode', 'chapter', 'ch', 'ep', 'c', 'e', '']:

'~(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()())\s*((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))*)~ui'

See the regex demo.

Pattern details

(?|(e)(pisodes?)|(c)(hapters?)|(c)(hs?)|(e)(ps?)|(c)(s?)|(e)(s?)|()()) - a branch reset group that captures the first letter of the search term and captures the rest of the term into an obligatory Group 2. If there is an empty term, the ()() are added to make sure the branches in the group contain the same number of groups
\s* - 0+ whitespaces
((\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+-]|and)\s*c?(?3))*) - Group 2:
- (\d+(?:\.\d+)?(?:[A-Z]\b)?) - Group 3: 1+ digits, followed with an optional sequence of ., 1+ digits and then an optional ASCII letter that should be followed with a non-word char or end of string (note the case insensitive modifier will make [A-Z] also match lowercase ASCII letters)
- (?:\s*(?:[,&+-]|and)\s*(?:(?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|)\s*)?(?4))* - zero or more sequences of
  - \s*(?:[,&+-]|and)\s* - a ,, &, +, - or and enclosed with optional 0+ whitespaces
  - (?:e(?:pisodes?)|c(?:hapters?)|c(?:hs?)|e(?:ps?)|c(?:s?)|e(?:s?)|) - any of the terms with added optional Plural endings s
  - (?4) - Group 4 pattern recursed / repeated

When the regex matches, the Group 1 value is c, so it will be the first part of the result. Then,

 "~(\d*(?:\.\d+)?)([A-Z]?)\s*-\s*(?:$chapter_aux_rx)?|(\d+(?:\.\d+)?(?:[A-Z]\b)?)(?:\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?(?1))*~ui"

is used inside preg_replace_callback to remove whitespaces in between - (if any) and terms (if any) followed with 0+ whitespace chars, and if Group 1 matches, the match is split with

"~\s*(?:[,&+]|and)\s*(?:$chapter_aux_rx)?~ui"

regex (it matches &, ,, + or and in between optional 0+ whitespaces followed with 0+ whitespaces and then an optional string, terms followed with 0+ whitespaces) and the array is passed to the buildNumChain function that builds the resulting string.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 16 '18 at 15:10

Wiktor Stribiżew

607,720
39
448
563

@Wiktor I feel `ltrim($float, '0') floatval($float)` should be within the `buildNumChain()` because the user can't remove zeroes with the final output, like `floatval('12.00 & 001-15.0')`. The unnecessary zeros should be removed before being pieced together like that. – Gary Woods Jul 19 '18 at 07:58
@GaryWoods Yes, that should be there, or the regex might turn out to be [quite long](https://stackoverflow.com/a/35351336/3832970). – Wiktor Stribiżew Jul 19 '18 at 08:02
Pulling together an extensive list of test titles. The only instance I noticed I missed in my question above is `chapter 2B` which should results with `c2 part 2`. In the current workflow you have here, I am not sure if this adjustment is doable? – Henrik Petterson Jul 19 '18 at 16:07
@HenrikPetterson Does it mean `chapter 2A` should be `c2 part 1` and so on? Is it Excel like column numbering? Can it appear in between other numbers, like `chapter 2, 3, 4B, 5`? Yes, it is possible to handle, I just do not know your requirements. – Wiktor Stribiżew Jul 19 '18 at 16:19
Yes, excel column numbering. `chapter 2, 3, 4B, 5` should not convert. Only `chapter 10.4C` should be `c10.4 part 3`. – Henrik Petterson Jul 19 '18 at 18:04
I will have compiled a massive list of tests by the end of this weekend. In addition to the `chapter 2B` example, we have `chapter 2-1` which should convert to `c2 part 1`. I think the logic we can go with is if the connecting value `2-1` is less or equal, then it is a `part`. So `chapter 2-2` is `c2 part 2`, but `chapter 2-3` is `c2-3`... – Henrik Petterson Jul 20 '18 at 12:58
1

After further thought, adding a `part` case for `2-1` may simply be inaccurate at some times, because, how would they in this format show part 3 or chapter 2... `2-3` would be chapters 2 to 3... So I can't think of a reasonable logic for this so maybe better to simply display it as `2-1` (as it currently is). However, `chapter 2A` is a pretty common case. – Henrik Petterson Jul 20 '18 at 18:31
@HenrikPetterson Ok, adding `chapter 10.4C` (=> `c10.4 part 3`), `chapter 2B` (=> `c2 part 2`) to test cases. But let's limit to `A-Z`. – Wiktor Stribiżew Jul 20 '18 at 18:39
1

Agreed. And case insensitive, meaning `chapter 2a` is a match too. – Henrik Petterson Jul 20 '18 at 19:54
@WiktorStribiżew Thank you for the update, but it should only count in the `A-Z` if it is a single letter attached with the number. So `chapter 2A Hello` should be `c2 part 1`, but `chapter 2AHello` should be `c2`. Does this make sense? – Henrik Petterson Jul 21 '18 at 16:01
@HenrikPetterson Oh, add a word boundary `\b`, or `(?!\p{L})` after that - does that fix it? See https://regex101.com/r/iWteaX/7 and https://ideone.com/quod1c – Wiktor Stribiżew Jul 21 '18 at 16:12
Lovely, that looks like it is working. Can you please update the answer so we have the final version on-site. Thanks1 – Henrik Petterson Jul 21 '18 at 16:39
@HenrikPetterson I updated. Once you confirm it is working as expected, I will clean up a bit and add / tidy up explanations. – Wiktor Stribiżew Jul 21 '18 at 16:48
@WiktorStribiżew Leading zeros and unnecessary decimals are not trimmed (properly) in [these scenario](https://ideone.com/IaQAxa). – Henrik Petterson Jul 23 '18 at 14:55
@HenrikPetterson Please see https://ideone.com/KA3Fph, I have added another test case, and it seems to be fixed now. – Wiktor Stribiżew Jul 23 '18 at 20:46
1

Looks like it is working! Running this through tests as we speak. Thanks! And if you get a chance, please add this change to the answer so others can see it clearly as well. – Henrik Petterson Jul 24 '18 at 11:50
1

I am doing final tests tomorrow but it is looking very good. You told me to let you know when it is working as expected... I believe it is now! – Henrik Petterson Jul 25 '18 at 17:05
Looks like I am too sleepy, sorry, in the morning. – Wiktor Stribiżew Jul 25 '18 at 20:58
1

Eternal-gratitude. <3 – Henrik Petterson Jul 26 '18 at 11:42
@WiktorStribiżew We discovered one reoccurring issue. If string is `c0`, it converts to `c-`, [see this](https://ideone.com/Ltb8lM). Do you have a fix for this issue? It should covert to `c0`. EDIT: It could be the way you trimmed zeroes out of the matching float...? – Henrik Petterson Jul 30 '18 at 13:13
1

@WiktorStribiżew No worries, thank you. Here is a new [ideone](https://ideone.com/FmPRdF) with further test cases demonstrating this issue. – Henrik Petterson Jul 30 '18 at 13:56
@WiktorStribiżew Did you ever have a chance to look through the issue noted above? Thanks! – Henrik Petterson Aug 02 '18 at 12:04
2

@HenrikPetterson Yes, I have, but I was puzzled by the fact my captures were evaluated as empty when the value was `0`. I changed `!empty` to `strlen... > 0` and that seems to work now. See [this update](https://ideone.com/gzaRA9) (I also changed the handling of `$x[2]` but I doubt it is critical). – Wiktor Stribiżew Aug 02 '18 at 13:27
1

@WiktorStribiżew Ah yes, `empty(0)` equals true which is super confusing. Your update is throwing a `Undefined offset on line 19`. Can you please update your answer with this fix? Thank you for taking the time to do this! – Henrik Petterson Aug 02 '18 at 17:37
@HenrikPetterson Fixed and updated. [This post](https://stackoverflow.com/a/25101006/3832970) turned out helpful, we need `isset($x[N]) && strlen($x[N])` to check if a group matched. – Wiktor Stribiżew Aug 02 '18 at 21:20
Great! [Added](https://ideone.com/vwNOKH) the same approach on the `buildNumChain()` function so we cater for `c0 & 1` converting to `c0-1`. – Henrik Petterson Aug 03 '18 at 10:48
@WiktorStribiżew We've come across a reoccurring bug. See [this](https://ideone.com/q7KGxV). `ch357 - ch360` converts to `c357-h360` when it should convert to `c357-360`. Do you know why this is the case? Thanks in advance. – Henrik Petterson Aug 28 '18 at 13:48
@HenrikPetterson You may fix it by fixing the term array: all shorter terms with the same "prefix" must be placed closer to the end. So, the correct term declaration is `["ch", "c"]`. See [this demo](https://ideone.com/EVMEKx). You may also add a word boundary to fix another potential issue. – Wiktor Stribiżew Aug 28 '18 at 14:05
@WiktorStribiżew Could you please elaborate on what you mean with adding a word boundary? Is there no appropriate code fix to this rather than adjusting the term declaration? Thanks again. – Henrik Petterson Aug 28 '18 at 14:18
@HenrikPetterson There is a way, you need to sort all `$terms` by length in a descending order, [`usort($terms, function($a, $b) { return strlen($b) - strlen($a); });`](https://ideone.com/y0diBy). `\b` is added at the start of `$chapter_main_rx` and `$chapter_aux_rx` patterns. – Wiktor Stribiżew Aug 28 '18 at 14:27
1

Yes this solved it. I edited your answer with this update! – Henrik Petterson Aug 29 '18 at 07:53

ElChiniNet · Answer 2 · 2019-10-08T23:20:25.937

I think that it is very complex to build something like this without throwing some false positives because some of the patterns might be contained in the title and in those cases, they will be detected by the code.

Anyway, I'll expose one solution that might be interesting to you, experiment with it when you have some time. I have not tested it deeply, so, if you find any problem with this implementation, let me know and I'll try to find a solution to it.

Looking at your patterns, all of them can be separated into two big groups:

from one number to another number (G1)
one or multiple numbers separated by commas, plus signs, or ampersands (G2)

So, if we can separate these two groups we can treat them differently. From the next titles, I'll try to get the chapter numbers in this way:

+-------------------------------------------+-------+------------------------+
| TITLE                                     | GROUP | EXTRACT                |
+-------------------------------------------+-------+------------------------+
| text chapter 25.6 text                    |  G2   | 25.6                   |
| text chapters 23, 24, 25 text             |  G2   | 23, 24, 25             |
| text chapters 23+24+25 text               |  G2   | 23, 24, 25             |
| text chapter 23, 25 text                  |  G2   | 23, 25                 |
| text chapter 23 & 24 & 25 text            |  G2   | 23, 24, 25             |
| text c25.5-30 text                        |  G1   | 25.5 - 30              |
| text c99-c102 text                        |  G1   | 99 - 102               |
| text chapter 99 - chapter 102 text        |  G1   | 99 - 102               |
| text chapter 1 - 3 text                   |  G1   | 1 - 3                  |
| 33 text chapter 1, 2 text 3               |  G2   | 1, 2                   |
| text v2c5-10 text                         |  G1   | 5 - 10                 |
| text chapters 23, 24, 25, 29, 31, 32 text |  G2   | 23, 24, 25, 29, 31, 32 |
| text chapters 23 and 24 and 25 text       |  G2   | 23, 24, 25             | 
| text chapters 23 and chapter 30 text      |  G2   | 23, 30                 | 
+-------------------------------------------+-------+------------------------+

To extract just the number of the chapters and differentiate them, one solution could be building a regular expression that captures two groups for the chapter ranges (G1) and one single group for the numbers separated by characters (G2). After the chapter numbers extraction, we can process the result to show the chapters correctly formatted.

Here is the code:

I've seen that you are still adding more cases in the comments that are not contained in the question. If you want to add a new case, just create a new matching pattern and add it to the final regexp. Just follow the rule of two matching groups for the ranges and a single matching group for the numbers separated by characters. Also, take into account that the most verbose patterns should be located before the lesser ones. For example ccc N - ccc N should be located before cc N - cc N and this last one before c N - c N.

$model = ['chapters?', 'chap', 'c']; // different type of chapter names
$c = '(?:' . implode('|', $model) . ')'; // non-capturing group for chapter names
$n = '\d+\.?\d*'; // chapter number
$s = '(?:[\&\+,]|and)'; // non-capturing group of valid separators
$e = '[ $]'; // end of a match (a space or an end of a line)

// Different patterns to match each case
$g1 = "$c *($n) *\- *$c *($n)$e"; // match chapter number - chapter number in all its variants (G1)
$g2 = "$c *($n) *\- *($n)$e"; // match chapter number - number in all its variants (G1)
$g3 = "$c *((?:(?:$n) *$s *)+(?:$n))$e"; // match chapter numbers separated by something in all its variants (G2) 
$g4 = "((?:$c *$n *$s *)+$c *$n)$e"; // match chapter number and chater number ... and chapter numberin all its variants (G2)
$g5 = "$c *($n)$e"; // match chapter number in all its variants (G2)

// Build a big non-capturing group with all the patterns
$reg = "/(?:$g1|$g2|$g3|$g4|$g5)/";

// Function to process each title
function getChapters ($title) {

    global $n, $reg;
    // Store the matches in one flatten array
    // arrays with three indexes correspond to G1
    // arrays with two indexes correspond to G2
    if (!preg_match($reg, $title, $matches)) return '';
    $numbers = array_values(array_filter($matches));

    // Show the formatted chapters for G1
    if (count($numbers) == 3) return "c{$numbers[1]}-{$numbers[2]}";

    // Show the formatted chapters for G2        
    if(!preg_match_all("/$n/", $numbers[1], $nmatches, PREG_PATTERN_ORDER)) return '';
    $m = $nmatches[0];
    $t = count($m);
    $str = "c{$m[0]}";
    foreach($m as $i => $mn) {
        if ($i == 0) continue;
        if ($mn == $m[$i - 1] + 1) {
            if (substr($str, -1) != '-') $str .= '-';
            if ($i == $t - 1 || $mn != $m[$i + 1] - 1) $str .= $mn;
        } else {
            if ($i < $t) $str .= ' & ';
            $str .= $mn;
        }
        return $str;
    }

}

You can check the code working on Ideone.

Thank you! I will experiment with this one once I have an extensive list of titles. — Henrik Petterson, Jul 19 '18 at 18:06

Julio · Answer 3 · 2018-07-18T17:08:51.700

7

Try with this. Seems to work with given examples and some more:

<?php

$title[] = 'c005 - c009'; // c5-9
$title[] = 'c5.00 & c009'; // c5 & 9
$title[] = 'text c19 & c20 text'; //c19-20
$title[] = 'c19 & c20'; // c19-20
$title[] = 'text chapter 19 and chapter 25 text'; // c19 & 25
$title[] = 'text chapter 19 - chapter 23 and chapter 25 text'; // c19-23 & 25 (c19 for termless)
$title[] = 'text chapter 19 - chapter 23, chapter 25 text'; // c19-23 & 25 (c19 for termless)
$title[] = 'text chapter 23 text'; // c23
$title[] = 'text chapter 23, chapter 25-29 text'; // c23 & 25-29
$title[] = 'text chapters 23-26, 28, 29 + 30 + 32-39 text'; // c23-26 & c28-30 & c32-39
$title[] = 'text chapter 25.6 text'; // c25.6
$title[] = 'text chapters 23, 24, 25 text'; // c23-25
$title[] = 'text chapters 23+24+25 text'; // c23-25
$title[] = 'text chapter 23, 25 text'; // c23 & 25
$title[] = 'text chapter 23 & 24 & 25 text'; // c23-25
$title[] = 'text c25.5-30 text'; // c25.5-30
$title[] = 'text c99-c102 text'; // c99-102 (c99 for termless)
$title[] = 'text chapter 1 - 3 text'; // c1-3
$title[] = 'sometext 33 text chapter 1, 2 text 3'; // c1-2 or c33 if no terms
$title[] = 'text v2c5-10 text'; // c5-10 or c2 if no terms
$title[] = 'text cccc5-10 text'; // c5-10
$title[] = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32
$title[] = 'chapter 19 - chapter 23'; // c19-23 or c19 for termless
$title[] = 'chapter 12 part 2'; // c12

function get_chapter($text, $terms) {
  $rterms = sprintf('(?:%s)', implode('|', $terms));

  $and = '(?:  [,&+]|\band\b  )';
  $isrange = "(?:  \s*-\s*  $rterms?  \s*\d+  )";
  $isdotnum = '(?:\.\d+)';
  $the_regexp = "/(
    $rterms \s*  \d+  $isdotnum?  $isrange?   
    (  \s*  $and  \s*  $rterms?  \s*  \d+  $isrange?  )*
  )/mix";

  $result = array();
  $result['orignal'] = $text;
  if (preg_match($the_regexp, $text, $matches)) {
    $result['found_match'] = $tmp = $matches[1];
    $tmp = preg_replace("/$rterms\s*/i", '', $tmp);
    $tmp = preg_replace('/\s*-\s*/', '-', $tmp);
    $chapters = preg_split("/\s* $and \s*/ix", $tmp);
    $chapters = array_map(function($x) {
        return preg_replace('/\d\K\.0+/', '',
               preg_replace('/(?|\b0+(\d)|-\K0+(\d))/', '\1', $x
        ));
    }, $chapters);
    $chapters = merge_chapters($chapters);
    $result['converted'] = join_chapters($chapters);
  }
  else {
    $result['found_match'] = '';
    $result['converted'] = $text;
  }
  return $result;
}

function merge_chapters($chapters) {
  $i = 0;
  $begin = $end = -1;
  $rtchapters = array();
  foreach ($chapters as $chapter) {
    // Fetch next chapter
    $next = isset($chapters[$i+1]) ? $chapters[$i+1] : -1;
    // If not set, set begin chapter
    if ($begin == -1) {$begin = $chapter;}
    if (preg_match('/-/', $chapter)) {
      // It is a range, we reset begin/end and store the range
      $begin = $end = -1;
      array_push($rtchapters, $chapter);
    }
    else if ($chapter+1 == $next) {
      // next is current + 1, update end
      $end = $next;
    }
    else {
      // store result (if no end, then store current chapter, else store the range
      array_push($rtchapters, sprintf('%s', $end == -1 ? $chapter : "$begin-$end"));
      $begin = $end = -1; // reset, since we stored results
    }
    $i++; // needed for $next
  }
  return $rtchapters;
}

function join_chapters($chapters) {
  return 'c' . implode(' & ', $chapters) . "\n";
}

print "\nTERMS LEGEND:\n";
print "Case 1. = ['chapters', 'chapter', 'ch', 'c']\n";
print "Case 2. = []\n\n\n\n";
foreach ($title as $t) {
  // If some patterns start by same letters, use longest first.
  print "Original: $t\n";
  print 'Case 1. = ';
  $result = get_chapter($t, ['chapters', 'chapter', 'ch', 'c']);
  print_r ($result);
  print 'Case 2. = ';
  $result = get_chapter($t, []);
  print_r ($result);
  print "--------------------------\n";
}

Output: See: https://ideone.com/Ebzr9R

edited Jul 18 '18 at 17:08

answered Jul 16 '18 at 17:29

Julio

5,208
1
13
42

Thanks for posting an alternative. Can it work with `'chapter 23 and 33'; // c23 & 33` and `'chapters 23 and chapter 33'; // c23 & 33`? – Henrik Petterson Jul 16 '18 at 18:30
I will check through this properly tomorrow and get back to you. THANK YOU! – Henrik Petterson Jul 16 '18 at 19:45
@HenrikPetterson Just worth noting that I considered the `v2` of `v2c5-10` a mandatory patter if there is no space before `c`. So it will match `v2c5-10` but not `ccc5-10`. If `v\d` is not mandatory and you can match anything, then the regexp will be a tad simpler. You can use this then on the script: https://regex101.com/r/0oQPlX/9 – Julio Jul 17 '18 at 07:21
1

Just trying this out @Julio - the following does not work so far: `c19 & c20`. Also, it is not always `v1c2`, it can be `xc2` or anything. – Henrik Petterson Jul 17 '18 at 11:44
1

@HenrikPetterson Both cases should work with the last version of the regexp. I have just updated my answer. – Julio Jul 17 '18 at 12:02
1

Stellar answer. You're offering an alternative approach to Wiktor's solution (who's basically a *God* in the regexiverse) - awesome! While I can totally read your code, it may be good to add comments to it so others can read it to. Also, if you look at my original (incomplete) code, I pass custom `$terms` to test the title. Would it be possible to adjust your code so we set the terms `$terms = ['chapter', 'ch', 'episode'...]`? And, would it be possible to pass an empty `$terms = [''];` and then we match (any numbers) using the current algorithm? Meaning, `hello 23 & 24` will be `c23-24`? – Henrik Petterson Jul 17 '18 at 12:40
Sure! I plan to comment the code once I match all requirements. So, if I only pass `$terms=['chapters']`, for example, does this mean that `chapters 19, 20, 21` should not be matched? (no 'chapters' before numbers) Also, if you pass nothing and just match numbers, then this mean you could match things like '2' at `v2 c19, c20` – Julio Jul 17 '18 at 15:05
Yes, that is correct, `2` will be matched in `v2 c19, c20`. Although `v2` will be stripped out of the string prior to running this. If we pass `$terms=['chapters']` on string `chapters 19, 20, 21 episode 3` then we should get `19-21`. Does this answer your questions? – Henrik Petterson Jul 17 '18 at 15:49
Yes, thank you. I have another one. In this case `chapter 19 - chapter 23` should it be matches with $term=['chapter']? what about with no $terms? – Julio Jul 17 '18 at 16:06
Great question. In the case of `chapter 19 - chapter 23` with *no* terms, we should only get `c19`. The reason is because we could have string `chapter 12 part 2`. Hope that makes sense. – Henrik Petterson Jul 17 '18 at 16:39
Done! Now I use a different approach with the regex. Instead of trying to match a monster regex, I clean the text data a bit and then I use a simpler regex. It seems to be way easier to read and maintain. I also commented the code. – Julio Jul 17 '18 at 17:03
Excellent, but the "original match" is not displayed. To simplify what I mean, running it on string `hello chapters 12 & 21 world`, we end up with something like `array('original_string' => 'hello chapters 12 & 21 world', 'found_match' => 'chapters 12 & 21', 'converted' => 'c12 & 21')`... hopefully this makes sense! – Henrik Petterson Jul 17 '18 at 18:34
So I'll need to go back to my original code. I cannot return 'found_mach' this way because I'm modifying the original text, before matching, in order to simplify the code. – Julio Jul 17 '18 at 18:54
@HenrikPetterson I updated my answer, now I return original_string, found_match and converted string. – Julio Jul 17 '18 at 22:56
@HenrikPetterson Did that last version work for you? Also, if not, It would nice for all of us to upload have some file with more lines to test. – Julio Jul 18 '18 at 10:07
I am going to pull together an extended list to test shortly. Will report back; thanks Julio!! – Henrik Petterson Jul 18 '18 at 10:11
2

@HenrikPetterson I added a clean up for '005' and '5.00' like numbers – Julio Jul 18 '18 at 17:10
Thank you! I am putting together the test cases. It may take a few days to do all the tests on these answers. Will report back soon. – Henrik Petterson Jul 19 '18 at 08:40
Just checking in to see if you had any chance to overlook this any further? Bounty ends in 7 hours :P – Julio Jul 25 '18 at 06:59

score 4 · Answer 4 · 2018-07-23T23:28:53.827

Use a general regex that captures the chapter info.

'~text\s+(?|chapters?\s+(\d+(?:\.\d+)?(?:\s*[-+,&]\s*\d+(?:\.\d+)?)*)|(?:v\d+)?((?:c\s*)?\d+(?:\.\d+)?(?:\s*[-]\s*(?:c\s*)?\d+(?:\.\d+)?)*)|(chapters?\s+\d+(?:\.\d+)?(?:\s*[-+,&]\s*chapter\s+\d+(?:\.\d+)?)*))\s+text~'

Then clean group 1 with this find '~[^-.\d+,&\r\n]+~' replace with nothing ''.

Then clean the clean with this find '~[+&]~' replace with comma ','

Updae
The php code below includes a function to consolidate individual chapter sequence
to chapter ranges.

Main regex, readable version

 text
 \s+ 
 (?|
      chapters?
      \s+ 
      (                             # (1 start)
           \d+ 
           (?: \. \d+ )?
           (?:
                \s* [-+,&] \s* 
                \d+ 
                (?: \. \d+ )?
           )*
      )                             # (1 end)
   |  
      (?: v \d+ )?
      (                             # (1 start)
           (?: c \s* )?
           \d+ 
           (?: \. \d+ )?
           (?:
                \s* [-] \s* 
                (?: c \s* )?
                \d+ 
                (?: \. \d+ )?
           )*
      )                             # (1 end)
   |  
      (                             # (1 start)
           chapters?
           \s+ 
           \d+ 
           (?: \. \d+ )?
           (?:
                \s* [-+,&] \s* 
                chapter
                \s+ 
                \d+ 
                (?: \. \d+ )?
           )*
      )                             # (1 end)


 )
 \s+ 
 text

Php code sample

http://sandbox.onlinephpfunctions.com/code/128cab887b2a586879e9735c56c35800b07adbb5

 $array = array(
 'text chapter 25.6 text',
 'text chapters 23, 24, 25 text',
 'text chapters 23+24+25 text',
 'text chapter 23, 25 text',
 'text chapter 23 & 24 & 25 text',
 'text c25.5-30 text',
 'text c99-c102 text',
 'text chapter 99 - chapter 102 text',
 'text chapter 1 - 3 text',
 '33 text chapter 1, 2 text 3',
 'text v2c5-10 text',
 'text chapters 23, 24, 25, 29, 31, 32 text');

 foreach( $array as $input ){
     if ( preg_match( '~text\s+(?|chapters?\s+(\d+(?:\.\d+)?(?:\s*[-+,&]\s*\d+(?:\.\d+)?)*)|(?:v\d+)?((?:c\s*)?\d+(?:\.\d+)?(?:\s*[-]\s*(?:c\s*)?\d+(?:\.\d+)?)*)|(chapters?\s+\d+(?:\.\d+)?(?:\s*[-+,&]\s*chapter\s+\d+(?:\.\d+)?)*))\s+text~',
                      $input, $groups ))
     {
         $chapters_verbose = $groups[1];
         $cleaned = preg_replace( '~[^-.\d+,&\r\n]+~', '',  $chapters_verbose );
         $cleaned = preg_replace( '~[+&]~',            ',', $cleaned   );

         $cleaned_and_condensed = CondnseChaptersToRanges( $cleaned );

         echo "\$title = '" . $input . "';  // c$cleaned_and_condensed\n";

     }        
 }

 function CondnseChaptersToRanges( $cleaned_chapters )
 {
         ///////////////////////////////////////
         // Combine chapter ranges.
         // Explode on comma's.
         //
         $parts = explode( ',', $cleaned_chapters );
         $size = count( $parts );
         $chapter_condensed = '';

         for ( $i = 0; $i < $size; $i++ )
         {
             //echo "'$parts[$i]' ";
             if ( preg_match( '~^\d+$~', $parts[$i] ) )
             {
                 $first_num = (int) $parts[$i];
                 $last_num  = (int) $parts[$i];
                 $j = $i + 1;

                 while ( $j < $size && preg_match( '~^\d+$~', $parts[$j] ) && 
                         (int) $parts[$j] == ($last_num + 1) )
                 {
                     $last_num = (int) $parts[$j];
                     $i = $j;
                     ++$j ;
                 }
                 $chapter_condensed .= ",$first_num";
                 if ( $first_num != $last_num )
                     $chapter_condensed .= "-$last_num";
             }
             else
                 $chapter_condensed .= ",$parts[$i]";
         }
          $chapter_condensed = ltrim( $chapter_condensed, ',' );

         return $chapter_condensed;
 }

Output

 $title = 'text chapter 25.6 text';  // c25.6
 $title = 'text chapters 23, 24, 25 text';  // c23-25
 $title = 'text chapters 23+24+25 text';  // c23-25
 $title = 'text chapter 23, 25 text';  // c23,25
 $title = 'text chapter 23 & 24 & 25 text';  // c23-25
 $title = 'text c25.5-30 text';  // c25.5-30
 $title = 'text c99-c102 text';  // c99-102
 $title = 'text chapter 99 - chapter 102 text';  // c99-102
 $title = 'text chapter 1 - 3 text';  // c1-3
 $title = '33 text chapter 1, 2 text 3';  // c1-2
 $title = 'text v2c5-10 text';  // c5-10
 $title = 'text chapters 23, 24, 25, 29, 31, 32 text';  // c23-25,29,31-32

score 1 · Answer 5 · answered Jul 24 '18 at 19:57

I branched your example, adding in a bit to take e.g. "chapter" and match both "c" and "chapter", then pulled all the matching expressions from the strings, extracted the individual numbers out, flattened any ranges found, and returned a formatted string like you had in your comments for each one:

So here's the link: ideone

The function itself (modified yours a bit):

function get_chapter($text, $terms) {

    if (empty($text)) return;
    if (empty($terms) || !is_array($terms)) return;

    $values = false;

    $terms_quoted = array();
    //make e.g. "chapters" match either "c" OR "Chapters" 
    foreach ($terms as $term)
        //revert this to your previous one if you want the "terms" provided explicitly
        $terms_quoted[] = $term[0].'('.preg_quote(substr($term,1), '/').')?';

    $matcher = '/(('.implode('|', $terms_quoted).')\s*(\d+(?:\s*[&+,.-]*\s*?)*)+)+/i';

    //match the "chapter" expressions you provided
    if (preg_match($matcher, $text, $matches)) {
        if (!empty($matches[0])) {

            //extract the numbers, in order, paying attention to existing hyphen/range identifiers
            if (preg_match_all('/\d+(?:\.\d+)?|-+/', $matches[0], $numbers)) {
                $bot = NULL;
                $top = NULL;
                $nextIsTop = false;
                $results = array();
                $setv = function(&$b,&$t,$v){$b=$v;$t=$v;};
                $flatten = function(&$b,&$t,$n,&$r){$x=$b;if($b!=$t)$x=$x.'-'.$t;array_push($r,$x);$b=$n;$t=$n;return$r;};
                foreach ($numbers[0] as $num) {
                    if ($num == '-') $nextIsTop = true;
                    elseif ($nextIsTop) {
                        $top = $num;
                        $nextIsTop = false;
                    }
                    elseif (is_null($bot)) $setv($bot,$top,$num);
                    elseif ($num - $top > 1) $flatten($bot,$top,$num,$results);
                    else $top = $num;
                }
                return implode(' & ', $flatten ($bot,$top,$num,$results));
            }
        }
    }
}

And the calling block:

$text = array(
'9 text chapter 25.6 text', // c25.6
'text chapter 25.6 text', // c25.6
'text chapters 23, 24, 25 text', // c23-25
'chapters 23+24+25 text', // c23-25
'chapter 23, 25 text', // c23 & 25
'text chapter 23 & 24 & 25 text', // c23-25
'text c25.5-30 text', // c25.5-30
'text c99-c102 text', // c99-102
'text chapter 99 - chapter 102 text', // c99-102
'text chapter 1 - 3 text', // c1-3
'33 text chapter 1, 2 text 3', // c1-2
'text v2c5-10 text', // c5-10
'text chapters 23, 24, 25, 29, 31, 32 text', // c23-25 & 29 & 31-32
);
$terms = array('chapter', 'chapters');
foreach ($text as $snippet)
{
    $chapter = get_chapter($snippet, $terms);
    print("Chapter is: c".$chapter."\n");
}

Which results in the output:

Chapter is: c25.6
Chapter is: c25.6
Chapter is: c23-25
Chapter is: c23-25
Chapter is: c23 & 25
Chapter is: c23-25
Chapter is: c25.5-30
Chapter is: c99-102
Chapter is: c99-102
Chapter is: c1-3
Chapter is: c1-2
Chapter is: c5-10
Chapter is: c23-25 & 29 & 31-32

Determining chapter number in different types of text

What I have so far

5 Answers5

Logic

Solution

Main points