3

I need to split a string into an array of letters. The problem is that in my language (Croatian) there are double character letters aswell (e.g. lj, nj, dž).

So the string such as ljubičicajecvijet should be split into an array that would look like this:

Array
(
    [0] => lj
    [1] => u
    [2] => b
    [3] => i
    [4] => č
    [5] => i
    [6] => c
    [7] => a
    [8] => j
    [9] => e
    [10] => c
    [11] => v
    [12] => i
    [13] => j
    [14] => e
    [15] => t
)

Here is the list of Croatian characters in an array (I included English letters aswell).

$alphabet= array(
            'a', 'b', 'c',
            'č', 'ć', 'd',
            'dž', 'đ', 'e',
            'f', 'g', 'h',
            'i', 'j', 'k',
            'l', 'lj', 'm',
            'n', 'nj', 'o',
            'p', 'q', 'r',
            's', 'š', 't',
            'u', 'v', 'w',
            'x', 'y', 'z', 'ž'
          );
dodo254
  • 519
  • 2
  • 7
  • 16
  • 1
    so how would you know if the string contains a `l` and `j` separately against the character `lj`? – Mihai Matei Oct 29 '16 at 13:18
  • Well, i was thinking about categorising letters by number of characters. The word would be first split by letters with more characters, and then by single character letters. Unfortunately, it also brings problems. – dodo254 Oct 29 '16 at 13:28

2 Answers2

1

You can use this kind of solution:

Data:

$text = 'ljubičicajecviježdžt';

$alphabet = [
            'a', 'b', 'c',
            'č', 'ć', 'd',
            'dž', 'đ', 'e',
            'f', 'g', 'h',
            'i', 'j', 'k',
            'l', 'lj', 'm',
            'n', 'nj', 'o',
            'p', 'q', 'r',
            's', 'š', 't',
            'u', 'v', 'w',
            'x', 'y', 'z', 'ž'
];

1. Order results by length in order to have the double letters at the beginning

// 2 letters first
usort($alphabet, function($a, $b) {
    if( mb_strlen($a) != mb_strlen($b) )
        return mb_strlen($a) < mb_strlen($b);
    else
        return $a > $b;
});

var_dump($alphabet);

2. Finally, split. I used preg_split function with preg_quote to protect the function.

// split
$alphabet = array_map('preg_quote', $alphabet); // protect preg_split
$pattern = implode('|', $alphabet); // 'dž|lj|nj|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|ć|č|đ|š|ž'

var_dump($pattern);

var_dump( preg_split('`(' . $pattern . ')`si', $text, null, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY) );

And the result :)

array (size=18)
  0 => string 'lj' (length=2)
  1 => string 'u' (length=1)
  2 => string 'b' (length=1)
  3 => string 'i' (length=1)
  4 => string 'č' (length=2)
  5 => string 'i' (length=1)
  6 => string 'c' (length=1)
  7 => string 'a' (length=1)
  8 => string 'j' (length=1)
  9 => string 'e' (length=1)
  10 => string 'c' (length=1)
  11 => string 'v' (length=1)
  12 => string 'i' (length=1)
  13 => string 'j' (length=1)
  14 => string 'e' (length=1)
  15 => string 'ž' (length=2)
  16 => string 'dž' (length=3)
  17 => string 't' (length=1)
Georges O.
  • 972
  • 1
  • 9
  • 21
  • Great solution, thank you very much for your answer :D – dodo254 Oct 30 '16 at 10:56
  • just wanted to ask you. While playing with your code I tried to change your usort to: `usort($alphabet, function($a, $b) { return mb_strlen($a) < mb_strlen($b); });` It seems to work fine like that aswell. What do you think about that? – dodo254 Oct 30 '16 at 12:27
  • Yes, of course :) It's working because this is the same 'behavior'. The second check was to ordonate the characters according their size `ddd > aa > ab > zz > a > b > c`. Not needed here. It's a 'pretty functionnality' :p – Georges O. Oct 30 '16 at 14:52
  • It's great indeed :D Btw, since you already solved this problem, I was wondering if you could maybe try to solve another. The problem involves or may involve this piece of code you already gave here. It is a bit more complicated (at least for me). Actually I wanted to use this piece of code to sort an array of words. I thought at first that this piece will be enough, but I ran into more problems. Now, you don't have to solve it if you don't want to...but I dare you. :P :D [http://stackoverflow.com/questions/40330383/sort-array-of-words-non-english-letters-double-character-letters-php] – dodo254 Oct 30 '16 at 15:23
1

Or you can use this to make sure every double is checked to match, and if it does (you could reduce the $alphabet-array to just match those double characters in my solution:

<?php

ini_set('display_errors',1); // this should be commented out in production environments
error_reporting(E_ALL); // this should be commented out in production environments


$string = 'ljubičicajecvijet';

$alphabet= [
            'a', 'b', 'c',
            'č', 'ć', 'd',
            'dž', 'đ', 'e',
            'f', 'g', 'h',
            'i', 'j', 'k',
            'l', 'lj', 'm',
            'n', 'nj', 'o',
            'p', 'q', 'r',
            's', 'š', 't',
            'u', 'v', 'w',
            'x', 'y', 'z', 'ž'
          ];

function str_split_unicode($str, $length = 1) {
    $tmp = preg_split('~~u', $str, -1, PREG_SPLIT_NO_EMPTY);
    if ($length > 1) {
        $chunks = array_chunk($tmp, $length);
        foreach ($chunks as $i => $chunk) {
            $chunks[$i] = join('', (array) $chunk);
        }
        $tmp = $chunks;
    }
    return $tmp;
}

$new_array = str_split_unicode($string,2);

foreach ($new_array as $key => $value) {
    if (strlen($value) == 2) {
        if (in_array($value, $alphabet)) {
            $test[$key] = $value;
            unset($new_array[$key]);
        }
    }
}

$new_array = str_split_unicode(join('',$new_array)); 

foreach ($test as $key => $value) {
    array_splice($new_array, $key, 0, $value);  
}

print_r($new_array);

?>
junkfoodjunkie
  • 3,168
  • 1
  • 19
  • 33