split string with preg_split on english (and non english letters)

Question

I want to separate my sentence(s) into two parts. Because they are made of English letters and non english letters. I have regex I am using in preg_split method to get normal letters and characters. This though, works for opposite and I am left with only Japanese and not english.

String I work with:

すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.

My attempt:

    $parts = preg_split("/[ -~]+$/", $cleanline); // $cleanline is the string above
            print_r($parts);

My result

Array ( [0] => すぐに諦めて昼寝をするかも知れない。   [1] => )

As you can see, I do get an empty second value. How can I get both the English and the non-English text into two different strings? Why is the English text not returning even if I use correct regex (from what I've been testing)?

This `[ -~]` is a range between space and tilde, is that what you are expecting? I think you may want `[- ~]+` that will give every english word as its own string, and the non english as one index (or multiples if there were a space). The `-` is a range unless it is escaped or the first/last character of the character class. — chris85, Nov 14 '16 at 02:55
Try `/(.+)([ -~])+$/` I suspect you need to put the text you want to capture separately into capture groups. — Red Mercury, Nov 14 '16 at 02:58
Possible duplicate of [Why is this regex allowing a caret?](http://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret) — chris85, Nov 14 '16 at 03:07

score 2 · Answer 1 · answered Nov 14 '16 at 03:03

2

try mb_split instead of preg_split function.

mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8"); 
$parts = mb_split("/[ -~]+$/", $cleanline);

answered Nov 14 '16 at 03:03

Arif Acar

1,461
2
19
33

`mb_split` unlike other PHP regex functions doesnt use delimiters, but that isnt the issue. The regex is the issue. – chris85 Nov 14 '16 at 03:04

Ibrahim · Answer 2 · 2016-11-14T03:42:07.567

If you have two spaces between the two strings as shown in your example, you can split them easily with a simple \s{2} :

<?php
$s = "すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.";
$s = preg_split("/\s{2}/", $s);
print_r($s);
?>

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] => I may give up soon and just nap instead.
)

Demo: http://ideone.com/uD2W1Q

score 2 · Accepted Answer · answered Nov 14 '16 at 09:28

2

You could use lookaround to split on boundary between non alphabetic and alphabetic + space

$str = 'すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.';
$parts = preg_split("/(?<=[^a-z])(?=[a-z\h])|(?<=[a-z\h])(?=[^a-z])/i", $str, 2);
print_r($parts);

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] =>   I may give up soon and just nap instead.
)

answered Nov 14 '16 at 09:28

Toto

89,455
62
89
125

Thanks! I currently using `$parts = preg_split("/[A-Za-z0-9_~\-!@#\$%\^&\*]|(?<=[a-z\h])(?=[^a-z])/i", $cleanline, 2);` It cuts off first big letter (for English letters) sometimes though for some reason, I'll be looking into that. – 今際のアリス Nov 19 '16 at 23:46

split string with preg_split on english (and non english letters)

3 Answers3