1

enter image description here

tresc and tresc_pelna

The same type, the same content

enter image description here

The same content. 876 characters in total.

Taken from db by ...AS data_dodania, p.data_modyfikacji, p.tresc, p.tresc_pelna, p.url, count(k.id)...

Echeon to website by <?= strlen($post['tresc_pelna']).'----'.strlen($post['tresc']) ?>

And guess what?

This is the output

876----3248

What the...?

I have completly no Idea what is happening here xD.

Please help guys :D

Both fields utf8_polish_ci and exactly same content

<?= mb_strlen($post['tresc_pelna'], 'utf-8').'----'.mb_strlen($post['tresc'], 'utf-8') ?>

Still bad result.

tresc over 3 thousands... what the... How? why?

O. Jones
  • 103,626
  • 17
  • 118
  • 172
Krystian Polska
  • 1,286
  • 3
  • 15
  • 27

2 Answers2

0

MySQL has two built-in functions for determining the length of variable-length items. One, which counts distinct unicode characters, is called CHAR_LENGTH(). The other counts octets (bytes), and is called LENGTH().

In PHP, strlen() counts octets, like MySQL's LENGTH(). Many unicode strings, especially those encoded in utf8, have a variable number of octets per character. You can use grapheme_strlen() to count those.

I've found it's sometimes helpful to do SELECT HEX(unicode_column) to figure out what's stashed in MySQL. Just fetching the column data puts you at the mercy of the character rendering of the MySQL client you use, and can be very confusing.

It's also possible your database columns have entitized data in them (for example the string &eacute; rather than the Unicode character é. If that entity text gets sent to a web browser, it renders as the letter.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
0

The difference between LENGTH and CHAR_LENGTH could explain a ratio of under 1.2x for most European text. It won't explain 3248:876, which is nearly 4x.

Perhaps these are part of the answer:

  • Htmlentities, such as &oacute; which is taking 8 bytes to represent a 2-byte utf8 character. We can't see whether one of them has < and the other has &lt;.
  • Formatting tags, such as <p>. Again, possibly &lt;p&gt;

Still, that is not enough to explain nearly 4x. For example, a simple letter, such as a, will be one byte, regardless of how it is encoded. Please provide the HEX for a small sample.

Rick James
  • 135,179
  • 13
  • 127
  • 222