3

I'm get data from MySQL db, varchar(255) utf8_general_ci field and try to write the text to a PDF with PHP. I need to determine the string length in the PDF to limit the output of the text in a table. But I noticed that the output of mb_substr/substr is really strange.

For example:

mb_internal_encoding("UTF-8");

$_tmpStr = $vfrow['title'];
$_tmpStrLen = mb_strlen($vfrow['title']);
for($i=$_tmpStrLen; $i >= 0; $i--){
     file_put_contents('cutoffattributes.txt',$vfrow['field']." ".$_tmpStr."\n",FILE_APPEND);
     file_put_contents('cutoffattributes.txt',$vfrow['field']." ".mb_substr($_tmpStr, 0, $i)."\n",FILE_APPEND);
}

outputs this:

screen shot from npp

npp file link

Database:

enter image description here enter image description here

My question is where does the extra character come from?

aLx13
  • 701
  • 5
  • 16
  • 2
    You're not providing an encoding to mb_substr; are you sure it's getting the right encoding? See [this answer](http://stackoverflow.com/questions/13953248/php-mb-substr-not-working-correctly), as well. – xathien Apr 22 '15 at 16:41
  • You use mb_strlen()/mb_substr() instead of strlen()/substr() because that could slice a multibyte-character in the middle, which is correct. What even mb_strlen()/mb_substr() can do is to slice a composite sequence in the middle, like the "n" and the accent on top. You might get away transcoding the content to a non-composite form, which exists for some accented letters. – Ulrich Eckhardt Apr 23 '15 at 07:15
  • Can You show us output of `bin2hex ($_tmpStr)`, after the variable is set? – Michas Apr 23 '15 at 21:42
  • @Michas bin2hex: 526f7a6d696172206369c499636961206b617761c5826b69207069657277737a792073746f706965c584 – aLx13 Apr 24 '15 at 07:44

3 Answers3

1

The extra character is first part of two byte UTF-8 sequence. You may have problems with internal encoding of Multibyte String Functions. Your code treats text as fixed, 1-byte encoding. The ń in UTF-8, hex C5 84, is treated as Ĺ„ in CP-1250 and Ĺ[IND] in ISO-8859-2, two characters.

Try to execute this one on the top of script:

mb_internal_encoding("UTF-8");

http://php.net/manual/en/function.mb-internal-encoding.php

Michas
  • 8,534
  • 6
  • 38
  • 62
1
  1. You need to ensure you're actually getting the data from the database in UTF-8 encoding by setting your connection encoding appropriately. This depends on your database adapter, see UTF-8 all the way through for details.
  2. You need to tell your mb_ functions that the data is in UTF-8 so they can treat it correctly. Either set this globally for all functions using mb_internal_encoding, or pass the $encoding parameter to your function when you call it:

    mb_substr($_tmpStr, 0, $i, 'UTF-8')
    
Community
  • 1
  • 1
deceze
  • 510,633
  • 85
  • 743
  • 889
  • I did use mb_internal_encoding but setting the encoding parameter of mb_substr to UTF-8 did work! – aLx13 Apr 23 '15 at 07:01
0

Aside from table and field being set to UTF-8 you need to set mysqli_set_charset('UTF-8') to UTF-8 also (if you are using mysqli).

Also did you try?

$_tmpStr = utf8_encode( $vfrow['title'] ); 
Izzy
  • 402
  • 6
  • 16
  • I already did this thats why i dont undestand this behavior... SET NAMES utf8 & SET CHARACTER SET 'utf8' – aLx13 Apr 23 '15 at 06:54
  • would you improve your question with actual table structure and few data from it? – Izzy Apr 23 '15 at 08:02