How to extract bible book name, chapter and verses number by regex expression in PHP?

Question

I see this: PHP preg_match bible scripture format

But my problem is a little different because I want to extract those elements out, not just match them. And my pattern is more complex:

'John 14:16–17, 25–26'
'John 14:16–17'
'John 14:16'
'John 14 16'
'John 14:16'
'John14 : 16'
'John     14 16'
'John14:　　　16'
'John14:16—17'
'John14 16 17'
'John14 : 16 17'
'John14 : 16  —   17'
'John    14 16 17'
'约翰福音 14    16 17' -> here is an actual example of unicode text

Should also consider '-', ':', and ' ' to be full-width or half-width character, such as '－', '：', and '　', I mean both should work.

What I want is to extract John(should support unicode), 14, 16 and 17(if exists) those elements.

I've tried:

$str = '10 : 12 — 15  % 52 .633 __+_+)_01(&( %&@#32$%!85#@60$'; 
preg_match_all('/[\d]+?/isU',$str, $t);

Not work very well.

Then I tried:

preg_match_all("([\u4e00-\u9fa5]+)[^\d\n]*(\d+)[^\d\n]*(\d+)[^\d\n]*(\d*)", "John 14:16", $out);
var_dump($out);

Also not work.

Ok, I found the solution, it works, but I'm not sure if it's 100% correct:

preg_match_all('#([\x{4e00}-\x{9fa5}]+)[^\d\n]*(\d+)[^\d\n]*(\d+)[^\d\n]*(\d*)#u', $keyword, $match);

but I think John was a little bit before unicodes time, is it really fair to expect him to plan that far in advance? 7;^) — norlesh, Jan 02 '14 at 03:11
To paraphrase Stephen Hawking, ["if God transmitted information through data, He'd use Unicode"](http://www.hawking.org.uk/does-god-play-dice.html) — Jason Sperske, Jan 02 '14 at 03:15
@Jack I mean, 'John' may be in Chinese language, I wrote 'John' here just for understandable. — Phoenix, Jan 02 '14 at 03:22
You can make this question better by giving actual examples of what you want to accept and not. You mentioned about "full-width" and "half-width" characters, I think you should provide examples (in Chinese is acceptable also). You also need to provide more clearly what exactly you want to extract. Should we extract just a sequence of numbers in "John 14 16 17"? Or should we separate which ones are the chapter number and the verse number? — justhalf, Jan 02 '14 at 06:40

revo · Accepted Answer · 2014-01-02T08:12:55.077

3

^(\p{L}+)?\s*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?

You need \p{L} to match even unicode characters.

\p{Zs} means any kind of white space, \p{Pd} any kind of dash or hyphen.

Live demo

preg_match_all("/^(\p{L}+)?\s*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?/m", "John 14:16", $out);
var_dump($out);

edited Jan 02 '14 at 08:12

answered Jan 02 '14 at 07:54

revo

47,783
14
74
117

How to extract bible book name, chapter and verses number by regex expression in PHP?

1 Answers1