3

I see this: PHP preg_match bible scripture format

But my problem is a little different because I want to extract those elements out, not just match them. And my pattern is more complex:

'John 14:16–17, 25–26'
'John 14:16–17'
'John 14:16'
'John 14 16'
'John 14:16'
'John14 : 16'
'John     14 16'
'John14:   16'
'John14:16—17'
'John14 16 17'
'John14 : 16 17'
'John14 : 16  —   17'
'John    14 16 17'
'约翰福音 14    16 17' -> here is an actual example of unicode text

Should also consider '-', ':', and ' ' to be full-width or half-width character, such as '-', ':', and ' ', I mean both should work.

What I want is to extract John(should support unicode), 14, 16 and 17(if exists) those elements.

I've tried:

$str = '10 : 12 — 15  % 52 .633 __+_+)_01(&( %&@#32$%!85#@60$'; 
preg_match_all('/[\d]+?/isU',$str, $t);

Not work very well.

Then I tried:

preg_match_all("([\u4e00-\u9fa5]+)[^\d\n]*(\d+)[^\d\n]*(\d+)[^\d\n]*(\d*)", "John 14:16", $out);
var_dump($out);

Also not work.

Ok, I found the solution, it works, but I'm not sure if it's 100% correct:

preg_match_all('#([\x{4e00}-\x{9fa5}]+)[^\d\n]*(\d+)[^\d\n]*(\d+)[^\d\n]*(\d*)#u', $keyword, $match);
Community
  • 1
  • 1
Phoenix
  • 1,055
  • 1
  • 12
  • 27
  • 1
    Which bible book names are in Unicode? :) – Ja͢ck Jan 02 '14 at 03:11
  • 1
    but I think John was a little bit before unicodes time, is it really fair to expect him to plan that far in advance? 7;^) – norlesh Jan 02 '14 at 03:11
  • To paraphrase Stephen Hawking, ["if God transmitted information through data, He'd use Unicode"](http://www.hawking.org.uk/does-god-play-dice.html) – Jason Sperske Jan 02 '14 at 03:15
  • @Jack I mean, 'John' may be in Chinese language, I wrote 'John' here just for understandable. – Phoenix Jan 02 '14 at 03:22
  • You can make this question better by giving actual examples of what you want to accept and not. You mentioned about "full-width" and "half-width" characters, I think you should provide examples (in Chinese is acceptable also). You also need to provide more clearly what exactly you want to extract. Should we extract just a sequence of numbers in "John 14 16 17"? Or should we separate which ones are the chapter number and the verse number? – justhalf Jan 02 '14 at 06:40

1 Answers1

3
^(\p{L}+)?\s*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?

You need \p{L} to match even unicode characters.

\p{Zs} means any kind of white space, \p{Pd} any kind of dash or hyphen.

Live demo

preg_match_all("/^(\p{L}+)?\s*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?[\p{Pd}\p{Zs}:]*(\d+)?/m", "John 14:16", $out);
var_dump($out);
revo
  • 47,783
  • 14
  • 74
  • 117