2

I want to separate my sentence(s) into two parts. Because they are made of English letters and non english letters. I have regex I am using in preg_split method to get normal letters and characters. This though, works for opposite and I am left with only Japanese and not english.

String I work with:

すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.

My attempt:

    $parts = preg_split("/[ -~]+$/", $cleanline); // $cleanline is the string above
            print_r($parts);

My result

Array ( [0] => すぐに諦めて昼寝をするかも知れない。   [1] => ) 

As you can see, I do get an empty second value. How can I get both the English and the non-English text into two different strings? Why is the English text not returning even if I use correct regex (from what I've been testing)?

  • 1
    This `[ -~]` is a range between space and tilde, is that what you are expecting? I think you may want `[- ~]+` that will give every english word as its own string, and the non english as one index (or multiples if there were a space). The `-` is a range unless it is escaped or the first/last character of the character class. – chris85 Nov 14 '16 at 02:55
  • Try `/(.+)([ -~])+$/` I suspect you need to put the text you want to capture separately into capture groups. – Red Mercury Nov 14 '16 at 02:58
  • @RedMercury `preg_split` doesn't capture. – chris85 Nov 14 '16 at 03:03
  • 1
    Possible duplicate of [Why is this regex allowing a caret?](http://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret) – chris85 Nov 14 '16 at 03:07
  • Do you have two spaces between the two strings? – Ibrahim Nov 14 '16 at 03:32

3 Answers3

2

try mb_split instead of preg_split function.

mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8"); 
$parts = mb_split("/[ -~]+$/", $cleanline);
Arif Acar
  • 1,461
  • 2
  • 19
  • 33
  • `mb_split` unlike other PHP regex functions doesnt use delimiters, but that isnt the issue. The regex is the issue. – chris85 Nov 14 '16 at 03:04
2

If you have two spaces between the two strings as shown in your example, you can split them easily with a simple \s{2} :

<?php
$s = "すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.";
$s = preg_split("/\s{2}/", $s);
print_r($s);
?>   

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] => I may give up soon and just nap instead.
)

Demo: http://ideone.com/uD2W1Q

Ibrahim
  • 6,006
  • 3
  • 39
  • 50
2

You could use lookaround to split on boundary between non alphabetic and alphabetic + space

$str = 'すぐに諦めて昼寝をするかも知れない。  I may give up soon and just nap instead.';
$parts = preg_split("/(?<=[^a-z])(?=[a-z\h])|(?<=[a-z\h])(?=[^a-z])/i", $str, 2);
print_r($parts);

Output:

Array
(
    [0] => すぐに諦めて昼寝をするかも知れない。
    [1] =>   I may give up soon and just nap instead.
)
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Thanks! I currently using `$parts = preg_split("/[A-Za-z0-9_~\-!@#\$%\^&\*\(\)]|(?<=[a-z\h])(?=[^a-z])/i", $cleanline, 2);` It cuts off first big letter (for English letters) sometimes though for some reason, I'll be looking into that. – 今際のアリス Nov 19 '16 at 23:46