Split string into array based on a unicode character range in PHP

Question

Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like

«Добрый день!» - сказал он, потянувшись…

into an array that goes like

[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…

So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with

preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

but my attempts to repurpose it into something different have so far failed:

preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?

why `«` character is captured as a separate item and the same opposite `»` is captured as a part of a string `день!»..` ? — RomanPerekhrest, Dec 09 '16 at 22:12
Yes, it's not really the best example, I'm willing to sacrifice the [0] there somehow. — Захар Joe, Dec 09 '16 at 22:46

score 2 · Answer 1 · answered Dec 09 '16 at 22:22

Use the following regex solution:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

Details:

\b(\p{Cyrillic}+\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
PREG_SPLIT_NO_EMPTY will discard empty values in the array
/u modifier will make the \b (word boundary) and \W Unicode aware, and will allow processing Unicode strings with regex.

I really like this elegant solution but when I try it in my own PHP, all I get is just a single line, no splits. It does work in your demo though. Why could that be? — Захар Joe, Dec 09 '16 at 22:40

bobble bubble · Answer 2 · 2016-12-10T01:07:24.543

2

How about splitting at an initial \b word boundary with u modifier.

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \b is followed by a word character. (?!^) prevents empty match if start.

See this demo at eval.in

edited Dec 10 '16 at 01:07

answered Dec 09 '16 at 22:51

bobble bubble

16,888
3
27
46

It is a logical solution but unfortunately I need the breaks to occur only on cyrillic characters so that, for example, "слово word" doesn't get split into two. – Захар Joe Dec 09 '16 at 22:54
@ЗахарJoe In this case you could try [`$res = preg_split('/\b(?=[^\Wa-z])/iu', $str);`](https://eval.in/694227) – bobble bubble Dec 09 '16 at 22:56
I've just tried the two regexes you provided and unfortunately my version of PHP (5.5.38) for some reason returns just a single array element in both cases. – Захар Joe Dec 09 '16 at 23:04
@ЗахарJoe Probably same issue with [`preg_split('/\b(?=\p{Cyrillic})/u', $str);`](https://eval.in/694230) similar Wiktor's answer. – bobble bubble Dec 09 '16 at 23:07
It probably is, and it's kinda baffling. I did set mb_internal_encoding ( 'UTF-8' ); and I don't think it should need any other tricks. Wonder what's broken and where. – Захар Joe Dec 09 '16 at 23:08
Testing it in http://sandbox.onlinephpfunctions.com I see that this (and Wiktor's) regex is broken in PHP < 5.3.10. Go figure. Since my own version is newer it could be something in the default config. – Захар Joe Dec 09 '16 at 23:33
@ЗахарJoe Figured out that it's probably related to [**Bug #52971**](https://bugs.php.net/bug.php?id=52971) which was [fixed in PHP 5.3.4](http://php.net/ChangeLog-5.php#5.3.4). There's also few related questions [like this one](http://stackoverflow.com/questions/4781898/regex-word-boundary-does-not-work-in-ut8-on-some-servers). You can have a try with [`$res = preg_split('/(?<!\pL)(?=\p{Cyrillic})(?!^)/u', $str);`](https://eval.in/694318) or [`$res = preg_split('/(?<!\pL)(?=[^\PLa-z])(?!^)/iu', $str);`](https://eval.in/694324) – bobble bubble Dec 10 '16 at 01:46
Thank you bobble bubble for pointing that out. I suspected a bug but would've probably never found one possible solution mentioned there which is to recompile PCRE. By the way, those two regexes do indeed work. I'll probably run time on those to see which one is the most efficient. Thanks again!! – Захар Joe Dec 10 '16 at 10:26
@ЗахарJoe you're welcome, [put an answer referencing the bug here as well](http://stackoverflow.com/a/41074513/5527985). – bobble bubble Dec 10 '16 at 10:35

Martin Cup · Accepted Answer · 2016-12-09T22:40:49.130

1

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

edited Dec 09 '16 at 22:40

answered Dec 09 '16 at 22:21

Martin Cup

2,399
1
21
32

Thank you but I think you should also checkout bobble bubble's answer which seems to be a little more elegant. – Martin Cup Dec 09 '16 at 22:54
1

Have already voted for this. Another variant: [`$res = preg_split('/\b(?=[а-я])/iu', $str);`](https://eval.in/694236) – bobble bubble Dec 09 '16 at 23:09
Same story. My PHP disrespects something (although I don't see why it would do that) and only the lookahead variant works. – Захар Joe Dec 09 '16 at 23:16

score 0 · Answer 4 · answered Dec 09 '16 at 22:21

Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF} with \p{Cyrillic} as suggested in another answer.

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

This regex loses every other word when I try it, only odd words are going into the array, even words are lost. — Захар Joe, Dec 09 '16 at 22:43
Don't use it with split, use it with match. This matches a string not a position to split. — Nicolas, Dec 09 '16 at 23:52

Split string into array based on a unicode character range in PHP

4 Answers4