0

I know it seems like its an encoding problem, I don't think so. I have a site where people use ckeditor to post some long text (stories) when the user saves their work some HTML goes to the database which is prepared for utf8 encoding for all tables.

For all posts I generate a "text thumbnail" which is a fragment of the full text, the full text looks good, in all pages I use utf-8.

The code I use to get my "text thumbnail":

     <?php
     $str = trim(strip_tags(nl2br($historia['texto']))); //get only text
     echo substr($str, 0, 99) . (strlen($str) > 100 ? '...' : ''); //get part of string, if original string was longer than 100 characters add 3 dots at the end
     ?>

By far I have been running the site for more than one month, the problem came with the next and specific string to be treated

<p>Foto artística<br>Mi esposo invito uno de sus viejos amigos a casa, un
   hombre muy impresionante, llegó en un auto de lujo, vistiendo finas ropas, 
   reloj de plata, cadenas de oro y cosas impresionantes, el nos platico de 
   muchas de las cosas a las que se dedico desde que perdió la comunicación 
   con mi esposo, desde ayudante de cocina hasta productor de películas 
   independientes que había logrado vender por sumas importantes de dinero,
   el motivo de su visita era porque necesitaba a alguien como mi esposo 
   para salir en una de sus filmaciones, a cambio recibiría una buena 
   cantidad de dinero, clases de actuación y otros beneficios, claro que 
   aceptamos sin pensarlo.</p>

When I process it with the php code above I get the following result:

Foto artísticaMi esposo invito uno de sus viejos amigos a casa, un hombre muy impresionante, lleg�...

That last word in specific is accented llegó, other words in the same string and "text thumbnail" such as artística didn't have the same problem, it seems the accented letter at the end is a problem, I have tried to use some php functions to try to encode/decode the string before using substring but I have not gotten any results, please if you can somehow guide me to the solution please do it.

Here is the php code behaving the same way in an online editor https://ideone.com/m6OjUN

Martin
  • 22,212
  • 11
  • 70
  • 132

2 Answers2

3

substr operates on bytes. You feed it a multibyte string which is not a good idea. The character ó has more than 1 byte, you split the string exactly at the position between the bytes of this character which breaks the character. Try mb_substr instead:

https://3v4l.org/jkAnv

<?php
$input = '<p>Foto artística<br>Mi esposo invito uno de sus viejos amigos a casa, un hombre muy impresionante, llegó en un auto de lujo, vistiendo finas ropas, reloj de plata, cadenas de oro y cosas impresionantes, el nos platico de muchas de las cosas a las que se dedico desde que perdió la comunicación con mi esposo, desde ayudante de cocina hasta productor de películas independientes que había logrado vender por sumas importantes de dinero, el motivo de su visita era porque necesitaba a alguien como mi esposo para salir en una de sus filmaciones, a cambio recibiría una buena cantidad de dinero, clases de actuación y otros beneficios, claro que aceptamos sin pensarlo.</p>';
     $str = trim(strip_tags(nl2br($input))); //get only text

     echo mb_substr($str, 0, 99) . (mb_strlen($str) > 100 ? '...' : ''); //get part of string, if original string was longer than 100 characters add 3 dots at the end
     ?>

If you want to find out how many bytes a character/string has, use strlen

https://3v4l.org/AKHid

<?php
var_dump(strlen('ó'));

References:

http://php.net/manual/en/function.substr.php

http://php.net/manual/en/function.mb-substr.php

Xatenev
  • 6,383
  • 3
  • 18
  • 42
  • 1
    +1 but it's worth noting your `mb_string` functions are only going to be useful if the correct character set is encoded at the page-load level, such as in the `php.ini` file or the top of the page. – Martin Feb 03 '19 at 20:20
1

Xatenev's answer is correct. However I wanted to add that it should be shown how to solve the issue more fully.

:: Do this first

  • Install the PHP Multibyte "mbstring" module.

You now have three choices;

i) Set the correct encoding in the whole of PHP

  • The set the PHP internal encoding in the php.ini settings file (You can also set the HTML and REGEX encoding as appropriate as well, using similar functions).

ii) Or Set the correct encoding on this whole page

iii) Or Set the correct encoding on the specific functions only:

Bonus points:

These do not directly apply to this question but could be related and are worth re-iterating.

  • Please Note that this answer on the UTF-8 All the way through question clearly shows that your MySQL - if used - needs to be _utf8mb4 not _utf8 as some 4-byte characters will still not be correctly saved by MySQL.

    Your character ó is 2-bytes.

  • Please also note that this answer also shows that you need to correctly encode HTML output to correctly show complex (ie 2+ byte UTF-8 acharacters).

Martin
  • 22,212
  • 11
  • 70
  • 132