0

so i have 2 similar string

$no_space    = ',بشکه,';

$with_space  = ',بشکه‌,';

i've been doing some query in database when i noticed this similar words return different results ... so i've wrote this little code to compare them character by character

$no_space    = ',بشکه,';
$with_space = ',بشکه‌,';


echo '<br /> ------------------------------------------ <br />';
$string  = $no_space; 

echo ' total length : '.mb_strlen( $string , "UTF-8" ).'<br />' ;
 for( $i = 0; $i < mb_strlen( $string , "UTF-8" ); $i++ ) {
    $char_b = mb_substr( $string , $i, 1 , "UTF-8");
    echo $i . ' -> '.$char_b.'<br />' ;
}

echo '<br /> ------------------------------------------ <br />';

$string  = $with_space; 
echo ' total length : '.mb_strlen( $string , "UTF-8" ).'<br />' ;
 for( $i = 0; $i < mb_strlen( $string , "UTF-8" ); $i++ ) {
    $char_b = mb_substr( $string , $i, 1 , "UTF-8");
    echo $i . ' -> '.$char_b  .'<br />';
}

so here is the result when i run the code with explorer text encoding set on unicode

 total length : 6
0 -> ,
1 -> ب
2 -> ش
3 -> ک
4 -> ه
5 -> ,

------------------------------------------
total length : 7
0 -> ,
1 -> ب
2 -> ش
3 -> ک
4 -> ه
5 -> ‌
6 -> ,

as you can see there is a space in the fifth charachter of second string but when i run the code with explorere on latin encoding i get this result

 total length : 6
0 -> ,
1 -> ب
2 -> Ø´
3 -> Ú©
4 -> Ù‡
5 -> ,

------------------------------------------
total length : 7
0 -> ,
1 -> ب
2 -> Ø´
3 -> Ú©
4 -> Ù‡
5 -> ‌
6 -> ,

the fifth char gives me ‌ which is clearly not a space , i know that becuz if i add a space in the string i get a space on the output

so what is this ‌ and how can i remove it from my string i've already tried

$with_space= html_entity_decode($with_space, ENT_QUOTES, "UTF-8");

and even something silly like this

$with_space= str_replace('‌', '', $with_space);

pleas note i want to remove these characters from any string not only on database

progwin
  • 11
  • 2
  • They are multibyte characters. Latin is a single byte font set. See http://kunststube.net/encoding/ and/or http://stackoverflow.com/questions/279170/utf-8-all-the-way-through. – chris85 Apr 22 '16 at 14:25
  • 1
    It looks like you've got the unicode 200c "zero width non-joiner". – borrible Apr 22 '16 at 15:53
  • `‌` is `U+200C` encoded as UTF-8 but decoded as `Windows-1252`. – Mark Tolonen Apr 23 '16 at 02:58

0 Answers0