You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL
character, which I'm guessing is 0x0A
. Some of your multi-byte-wide characters contain 0x0A
as a single byte of their 'character', so explode
(acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump
your exploded array and you'll see the issue easily enough.
Try count(mb_split('(\r?\n)', $text))
and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:
Match linebreaks - \n or \r\n?
Remember that your line ending might possibly be \u0085
, but I doubt it as PHP_EOL
is being too aggressive.
If mb_split
works, remember that you'll need to be using PHP's mb_
functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_
functions to handle multi-byte wide characters.