Split MB string based on length

Question

I have a string which is in special language character.

先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)

My requirement is to make it an array in case the character limit exceeds my requirement using php. Like if it exceeds say 15 characters.

For that, I have tried

if(mb_strlen($string) > 15){

    $seed = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

But it is breaking. It is not breaking for all the cases but for the one has 35 chars.

Another approach I have tried is using this function:-

function word_chunk($str, $len = 76, $end = "||") {
                        $pattern = '~.{1,' . $len . '}~u'; // like "~.{1,76}~u"
                        $str = preg_replace($pattern, '$0' . $end, $str);
                        return rtrim($str, $end);
            }

Please help and understand that I need help for MB characters only

Show us a sample input string (or three) and expected output. — mickmackusa, Dec 20 '17 at 09:57
Actually, i cannot because it is in the other language and I do not know how much confidential the content is. And I am not able to read that language I am just a coder you know :) — Gagan, Dec 20 '17 at 10:01
Are you looking for a multi-byte [`wordwrap`](http://php.net/manual/en/function.wordwrap.php) equivalent? Something like https://stackoverflow.com/a/4988494/487813 ? — apokryfos, Dec 20 '17 at 10:01
How can we reproduce your issue and understand your expected result then? This is a criteria for on topic questions. — mickmackusa, Dec 20 '17 at 10:03
not word wrap but str_split equivalent str_split($str, 3); but MB do not have the ability to give me range param — Gagan, Dec 20 '17 at 10:03
I have updated my question with text, I have no clue what that means — Gagan, Dec 20 '17 at 10:05
@Gagandeep I think you should keep the `if(mb_strlen($string) > 15)` condition, and if it is true, use `preg_match_all('~\X~u', $str, $arr); return $arr[0];`. See [**this PHP demo**](https://3v4l.org/F3Fq9). — Wiktor Stribiżew, Dec 20 '17 at 10:39
If you need an output like in mickmackusa's answer, see [**this PHP demo**](https://3v4l.org/mt6Vm) — Wiktor Stribiżew, Dec 20 '17 at 10:53
Actually, when I checked the code in the snippet checker it is working fine, but when I execute my csv in which there are many rows its breaking and showing this This page isn’t working — Gagan, Dec 20 '17 at 12:01

mickmackusa · Accepted Answer · 2017-12-20T11:52:21.520

3

This will split your string after every 10th "extended grapheme cluster" (suggested by Wiktor up in the comments).

var_export(preg_split('~\X{10}\K~u', $string));

preg_split('~.{10}\K~u', $string) will work on your sample string, but for cases beyond yours, \X is more robust when dealing with unicode.

From https://www.regular-expressions.info/unicode.html:

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

Here is a related SO page.

The \K restarts the fullstring match, so there are no characters lost in the split.

Here is a demo where $len=10 https://regex101.com/r/uO6ur9/2

Code: (Demo)

$string='先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)';
var_export(preg_split('~\X{10}\K~u',$string,));

Output:

array (
  0 => '先秦兩漢先秦兩漢先秦',
  1 => '兩漢漢先秦兩漢漢先秦',
  2 => '兩漢( 243071',
  3 => ')',
)

Implementation:

function word_chunk($str,$len){
    return preg_split('~\X{'.$len.'}\K~u',$str);
}

While preg_split() might be slightly slower than preg_match_all(), one advantage is that preg_split() provides the desired 1-dimensional array. preg_match_all() generates a multi-dimensional array by which you would only need to access the [0] subarray's elements.

edited Dec 20 '17 at 11:52

answered Dec 20 '17 at 10:06

mickmackusa

43,625
12
83
136

$seed = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY); -- what is the difference, please explain? – Gagan Dec 20 '17 at 10:07
That will split on every mb character: http://sandbox.onlinephpfunctions.com/code/4c15f2f40df93535e24f7179cd551ef9d8681228 – mickmackusa Dec 20 '17 at 10:10
I see the tester website for regex, can I test my actual text there, and how do I know if my text is working fine? Please help – Gagan Dec 20 '17 at 10:12
just a sec you are fast, let me test – Gagan Dec 20 '17 at 10:13
I am quite confident it is working, let me test finally on my dev instance – Gagan Dec 20 '17 at 10:15
Can you please make it like a function like i wrote in my question? – Gagan Dec 20 '17 at 10:17
function word_chunk($str) { $array = preg_split('~.{1,20}\K~u', $string, -1, PREG_SPLIT_NO_EMPTY); return $array; } – Gagan Dec 20 '17 at 10:18
1 for the help so far – Gagan Dec 20 '17 at 12:02
As I said the code is working when I run it individually, but my CSV is breaking, I think my magento code is breaking somewhere – Gagan Dec 20 '17 at 12:03
Worked, in the end the issue was my laziness. I ctrl+H and tried to explode all ** with ||, which turn out that the comment which is above function was also changed, so my PHP file was not executing properly. – Gagan Dec 20 '17 at 12:42

Split MB string based on length

1 Answers1

Linked