PHP method for stripping duplicate chars from a multibyte string?

Question

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?

Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had

"aaabggxxyxzxxgggghq xcccxxxzxxyx"

It would return "abgh qxyz" (Note the space IS counted).

(The order isn't important in this case, can be anything).

If Japanese kanji (not sure browsers will all support this):

漢漢漢字漢字私私字私字漢字私漢字漢字私

And it will return just the 3 kanji used:

漢字私

It needs to work on any UTF-8 encoded string.

In the accepted answer below, the first script is designed to remove CONSECUTIVE duplicates only. The other scripts in the answer are overly verbose. See the advice on the dupe target. Also note in another answer that `str_split()` is not multibute/unicode safe. — mickmackusa, May 28 '23 at 02:09

Charles · Accepted Answer · 2011-03-25T19:51:49.600

4

Hey Dave, you're never going to see this one coming.

php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc

What, you thought I was going to use mb_substr again?

In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.

The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.

Hey, guess what!

$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
    $char = mb_substr($not_kanji, $i, 1);
    if(!array_key_exists($char, $unique))
        $unique[$char] = 0;
    $unique[$char]++;
}
echo join('', array_keys($unique));

This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.

This would have been much easier if there was such a thing as mb_str_split to go along with str_split.

(No Kanji example here, I'm experiencing a copy/paste bug.)

Here, try this on for size:

function mb_count_chars_kinda($input) {
    $l = mb_strlen($input);
    $unique = array();
    for($i = 0; $i < $l; $i++) {
        $char = mb_substr($input, $i, 1);
        if(!array_key_exists($char, $unique))
            $unique[$char] = 0;
        $unique[$char]++;
    }
    return $unique;
}

function mb_string_chars_diff($one, $two) {
    $left = array_keys(mb_count_chars_kinda($one));
    $right = array_keys(mb_count_chars_kinda($two));
    return array_diff($left, $right);
}

print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* => 
Array
(
    [5] => f
    [6] => g
)
*/

You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

edited Mar 25 '11 at 19:51

answered Mar 24 '11 at 04:24

Charles

50,943
13
104
142

Charles, thank you for the code! It's almost what I need, but the returned values still have duplicate characters. "abcbc" has 2 "b"s and 2 "c"s. I just need it to return "abc" (one instance ONLY of each unique character.) It seems like somehow it needs to check the current character against a prexisting list, add the character if it's missing, and NOT add it if it is already there. I know mentally how to do it, just that I'm not a programmer and don't know how to get it to do that in code. >_ – Dave Mar 24 '11 at 12:16
Ah, I misread your question -- I thought you were looking for duplicate sequential character removal, when you just wanted all of the unique characters in the string. I'll update my answer in a bit. – Charles Mar 24 '11 at 15:38
I can't thank you enough, Charles. This is exactly what I needed. And it works just fine with kanji too! – Dave Mar 25 '11 at 01:56
Okay, now I'm noticing something very strange. This is with a HUGE string that includes (I hope!) every Japanese, Korean, and Roman-Latin-European character and punctuation.The string (according to mb_strlen) has 9080 chars to start. 30 are not unique. So I end up with a 9050 length string. I use the output of the first function above to get just a unshuffled list, length 9050. Then I shuffle that output, and get 9050 again. HOWEVER, when I then copy those two (unique, unshuffled) and (unique, shuffled) strings into my program they're consistently a few characters off! Any idea why? – Dave Mar 25 '11 at 15:07
Are those characters *always* in UTF-8, and are you including the encoding in the various calls to the `mb_` functions? I excluded them here because I wasn't completely sure what charset you were using. – Charles Mar 25 '11 at 15:16
Doing more research, I noticed that while the original string contained a "h" (regular roman, normal "h") the new processed string doesn't.Somehow, this character (and others) are being lost. I'm stumped, but my one thought is maybe this won't work on a string that's mixed with regular alphanumerals as well as things like kanji or hangul? The string is so long it's hard to pick through and see what characters are missing.Can I possibly ask a 3rd function: compare 2 strings and show only the characters that are NOT in both strings? Either way, thanks so much for your help!! – Dave Mar 25 '11 at 15:32
As it happens, you can take the results of the `array_keys` calls here for the two strings, then use [`array_diff`](http://us2.php.net/manual/en/function.array-diff.php) to get the difference. – Charles Mar 25 '11 at 15:59
Trying to put this together on my own: array_keys($not_kanji) and array_keys($unique)? then something like: $diff = array_diff(array_keys($not_kanji), array_keys($unique)); – Dave Mar 25 '11 at 18:15
1: Take one string and run it through the process. Stick `array_keys($unique)` in a variable. 2: Take the other string and run it through the process. Stick `array_keys($unique)` in another variable. 3: `$diff = array_diff($var1, $var2);`, tada! – Charles Mar 25 '11 at 18:20
Okay, almost getting it. I've put your code above into a function and am currently returning the last "join" as the output. If I assign array_keys($unique) to a variable will that be available outside the function? – Dave Mar 25 '11 at 18:56
I tried doing this (tried posting my code but it's too long to fit) and am getting a return value of "Array" -- am betting I'm not doing it right. Super sorry to be consuming so much of your time. Thank you for the help! – Dave Mar 25 '11 at 19:15
The string "Array" means you managed to turn an array into a string using a string operator. PHP is nice enough not to complain about this. I'll c&p together something for you in a few moments. :) – Charles Mar 25 '11 at 19:37
@Dave, I've added two functions and an example to my question. Hopefully that'll be the last thing you need, this comment area is getting a bit long! – Charles Mar 25 '11 at 19:52
Works, Charles!! I was doing "echo" but clearly "print_r" is vital to the equation! Thanks so much. I WON'T pester any more!! You've been a huge help! – Dave Mar 25 '11 at 21:00
@Dave, glad to hear you got it working. Yeah, `echo` just handles strings, if you need to inspect an array, `print_r`, `var_dump` and `var_export` are all good tools. – Charles Mar 25 '11 at 21:10
Okay,I'm going to post a new question, because sure enough, these two arrays are _different_ and they should be the same.I am running the first function (string A), to remove duplicate chars,which goes from 9098 chars to 9050 (string B). Then I shuffle the 9050 and get at third string (string C).However (if this third function is outputting correctly, and I think so because I can SEE that one string is longer) for some reason the two resulting strings (B and C) DO have different characters! Why would that be?(You've done enough,so I'm posting a new question...thanks so much for the help! :-) – Dave Mar 25 '11 at 21:29
@Charles, will these functions work on strings that are in "󢀇 뀄 etc." format? I'm realizing it would (maybe?) be better to do these functions in Unicode UTF-8. – Dave Mar 27 '11 at 23:02
@Dave, no, it'll process those literally. You might want to run the strings through [`html_entity_decode`](http://us2.php.net/manual/en/function.html-entity-decode.php) first, to convert those entities into real characters. Take note of the last argument, which is the character set that will be used during the decode. – Charles Mar 27 '11 at 23:10
I'm thinking I might do it (assuming this is doable!) by reading each value up through the semi-colon into an array, then shuffling the array elements or comparing the array elements to see if they're identical. If  and another  exist, it shouldn't be too hard to tell those are the same. Thanks (again) for the help!! – Dave Mar 28 '11 at 14:34
@Dave, while comparing the two *encoded* strings against each other for similarity will work, when you compare it to the actual represented character, it will fail. – Charles Mar 28 '11 at 16:23

HoldOffHunger · Answer 2 · 2018-09-25T16:46:52.353

0

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);

Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

edited Sep 25 '18 at 16:46

answered Aug 11 '13 at 00:33

HoldOffHunger

18,769
10
104
133

score 0 · Answer 3 · answered Mar 24 '11 at 01:29

0

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

answered Mar 24 '11 at 01:29

Igor

2,619
6
24
36

This doesn't seem to have the option to return the unique chars though, just the _number_. I need a way to take a string and reduce it to only its unique chars. Or am I missing something? – Dave Mar 24 '11 at 01:53
no you are not missing, I mentioned the iconv as the only known to me the php library which is able to deal with the multibyte encodings. Another option would be to use the database backend and make something like DISTINCT selection and counting etc. – Igor Mar 24 '11 at 02:37
Okay, no, what I need is the _3_ (above count_chars($string, 3)) which shows the characters, not a number. But thanks anyway. Still hoping to figure this out somehow... – Dave Mar 24 '11 at 03:14

PHP method for stripping duplicate chars from a multibyte string?

3 Answers3