0

I pull tons of posts from novel sites where they use this abbreviation for the volume and chapter: v5c91. So here, we have Volume 5 and Chapter 91.

Here are some examples of titles:

$string = 'hello v2c19 lorem';
$string = 'hello v2 c19 lorem';
$string = 'hello c19 lorem';
$string = 'v8 hello c19 lorem';
$string = 'hello lorem v01';

What regex can I use to pull the volume and chapter out of those examples? So I end up with something like v8c19.

Henrik Petterson
  • 6,862
  • 20
  • 71
  • 155
  • What happens when only either of volume or chapter is provided? – blhsing Jul 16 '18 at 08:15
  • What if `v01` is not a volume, but some `version`? Regex won't tell one from the other. What is the rule here? If you plan to match `c` or `v` that are followed with 1+ digits as a whole word, it will be a very basic regex, but it might overfire in various situations. – Wiktor Stribiżew Jul 16 '18 at 08:17
  • Please fix the question: 1) add the language (or regex flavor) tag, 2) add the code you have so far, 3) if the code is too bad, add the actual pattern requirements. – Wiktor Stribiżew Jul 16 '18 at 08:24
  • @blhsing If either is provided, for example `hello v9 ipsum`, then we get `v9`. So just the volume. – Henrik Petterson Jul 16 '18 at 08:24
  • @WiktorStribiżew The type of posts I pull won't use any other variations of `v`, like `version`. So I understand this can be quite entry level stuff. – Henrik Petterson Jul 16 '18 at 08:26
  • If you just need the substitute, [see this demo at eval.in](https://eval.in/1040017) – bobble bubble Jul 20 '18 at 20:46

1 Answers1

2

To avoid matching titles with v{num} and c{num} in them, I think you want something like this:

(\bc\d+)|\bv\d+(c\d+) will catch chapters and (\bv\d+)|\bc\d+(v\d+) will capture volumes

EDIT: To capture partial chapters like c2.5, simply replace \d+ with a slighly modified regex that captures floating points (?:[0-9]*[.])?[0-9]+

It looks for a word boundary followed by the letter (c or v) and then digits, OR in the case of v1c3, it looks for the correct prefix followed by the match.

Here are some examples:

const inputs = [
  'hello v2c19 lorem',
  'hello v2.5 c19 lorem',
  'hello c19 lorem',
  'v8 hello c19 lorem',
  'hello lorem c01',
  'novolume nav123',
  'hello noch123pter',
];

const find = (str, regex) => {
  let res = null;
  const match = regex.exec(str);
  if (match) {
    res = match[1] || match[2];
  }
  return res;
};
const FLOAT = `(?:[0-9]*[.])?[0-9]+`;
const vRE = new RegExp(`(\\bv${FLOAT})|\\bc${FLOAT}(v${FLOAT})`);
const cRE = new RegExp(`(\\bc${FLOAT})|\\bv${FLOAT}(c${FLOAT})`);
const output = inputs.map((title) => {
  const chapter = find(title, cRE);
  const volume = find(title, vRE);
  return {
    title,
    chapter,
    volume
  };
});

console.log(output);

It's possible to combine these into all of the combinations of only chapter, only volume, chapter space volume, volume chapter etc... but that gets confusing fast and these are simple enough regex's to do the job.

AnilRedshift
  • 7,937
  • 7
  • 35
  • 59
  • 1
    Can I make it work with float, like `hello c9.5 ispum`? And is there a way to combine them so it matches with `hello v2c95 ipsum`? – Henrik Petterson Jul 16 '18 at 08:30
  • 1
    I am not OP, but I would *love* to see a PHP version of your example code. +1 – Gary Woods Jul 16 '18 at 08:36
  • 1
    Sure, that's just a matter of adding https://stackoverflow.com/questions/12643009/regular-expression-for-floating-point-numbers instead of \d+. I'll update the answer – AnilRedshift Jul 16 '18 at 08:38
  • @GaryWoods I don't know PHP and it's too late at night for me to learn it. Hopefully someone else can help answer, although the core regex should be PCRE compatible. – AnilRedshift Jul 16 '18 at 09:03