Multibyte-safe way to find unique characters in a string

Question

I have a problem that I thought would be simple but it's turning out to be quite complex.

I have a long UTF-8 string that is a mix of Roman, Western-European, Japanese, and Korean characters and punctuation. Many are multibyte chars, but some (I think) are not.

I need to do 2 things:

Make sure there are no duplicate chars (and output that new string, stripped of dupes).
Randomly shuffle that new string.

(Sorry, I can't seem to get the code quoting to format right...)

function uniquechars($string) {
    $l = mb_strlen($string);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($string, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    $uniquekeys = join('', array_keys($unique));
    return $uniquekeys;
}

and:

function unicode_shuffle($string)
{
    $len = mb_strlen($string);
    $sploded = array(); 
    while($len-- > 0) { 
        $sploded[] = mb_substr($string, $len, 1);
    }
    shuffle($sploded);
    $shuffled = join('', $sploded);
    return $shuffled;
}

Using those two functions, which someone very helpfully provided, I THOUGHT I was all set...except that curiously, it seems like the Unique string (no duplicates) and the Shuffled string do not contain the same number of characters. (I am highlighting these chars from my browser and then cutting-and-pasting into another application...one string is always a different length than the one above, but often it varies...it's not even the same number of chars getting truncated each time!).

I'm sorry I don't know enough about PHP nor about coding to sleuth this myself but what on earth is going wrong here? It seems like it should be easy to just shuffle a big long string, but apparently it's much harder than I thought. Is there maybe another, easier way to do this? Should I convert the string first into respective hex numbers and shuffle those, then convert back to UTF-8? Should I output to a file rather than the screen?

Anyone out there have suggestions? I'm sorry, I'm very new to this, so possibly I'm just doing something really dumb.

Formatting code is easy: use 4 spaces in front of every line and it gets recognized as code. Please reformat your code. — Nick Weaver, Mar 25 '11 at 22:00
@apesa: thank you! I somehow thought I had to put 4 spaces only in the first line. — Dave, Mar 25 '11 at 23:16
Related (half of the task): [PHP: Split multibyte string (word) into separate characters](https://stackoverflow.com/q/2556289/2943403) — mickmackusa, May 28 '23 at 02:15

Craig Sefton · Answer 1 · 2011-03-26T23:54:43.630

2

You can probably do things a lot simpler.

Here's a function to get only the unique characters in a string:

// returns an array of unique characters from a given string
function getUnique( $string ) {

    $chars = preg_split( '//', $string, -1, PREG_SPLIT_NO_EMPTY );
    $unique = array_unique( $chars );

    return $unique;

}

Then, if you want to reshuffle the order, just pass the array of unique chars to shuffle:

$shuffled = shuffle( $unique );

Edit: For multi-byte characters, this function should do the trick (thanks to http://php.net/manual/en/function.mb-split.php for helping with the regex):

function getUnique( $string ) {

    $chars = preg_split( '/(?<!^)(?!$)/u', $string ); 
    $unique = array_unique( $chars );

    return $unique;

}

edited Mar 26 '11 at 23:54

answered Mar 26 '11 at 22:50

Craig Sefton

903
11
20

Craig, thanks very much for providing this...I've decided to try doing it a little different way: with Unicode. Can this be edited to strip out unique patters that are in the uABCD; format? Either way thank you for the suggestion!! I will try this too and keep the fingers crossed! – Dave Mar 27 '11 at 02:43
@Dave - not a problem, hope it helps. I did test it with a string that contained Chinese characters, and it seemed to work perfectly. (Just remember to make sure you've got a UTF-8 header set for the output if you're viewing it in a browser otherwise things will look incorrect). I wouldn't recommend trying to work with `uABCD;` formatted chars since you'd just make it more complicated for yourself, but I'm sure you'd be able to get a regular expression working for that, too. Let me know if things work out. – Craig Sefton Mar 27 '11 at 12:28

mickmackusa · Answer 2 · 2023-05-28T02:13:24.990

If you didn't need to shuffle the characters, you could remove all duplicated characters in a single pass using a slightly more laborious pattern with a lookahead for a duplicate.

To shuffle the characters, you split the string between each character, then call array_unique() on that array. The shuffling part may not be useful to other developers, but note that the returned value from shuffle() is a boolean value (not the shuffled payload) so don't bother assigning the return value to a variable.

Removing dupe chars from a string: (Demo)

$str = 'ăāæåß§śšşçæåß§ś';

var_export(
    preg_replace('/(.)(?=.*\1)/u',
    '',
    $str
);

Split, remove dupes, shuffle: (Demo)

$str = 'ăāæåß§śšşçæåß§ś';

$unique = array_unique(
    preg_split(
        '//u',
        $str,
        0,
        PREG_SPLIT_NO_EMPTY
    )
);

shuffle($unique); 

var_export($unique);

I assume that mb_str_split() would also be safe to split whole characters, but I don't know if there are any fringe concerns with encodings.

Multibyte-safe way to find unique characters in a string

2 Answers2

Linked

Related