Problem when reading file with non-English characters in PHP

Question

Currently, I'm facing an issue of reading a file that contains non-English characters. I need to read that file line by line using the following code:

while(!feof($handle)) {
   $line = fgets($handle);
}

The case is this file has 1711 lines, but the strange thing is it shows 1766 lines when I tried traversing that file.

$text = file_get_contents($filePath);
$numOfLines = count(explode(PHP_EOL, $text));

I would appreciate so much if anyone can help me out this issue.

1) Try reducing that file to a smaller one. 2) If you still can't figure out, please share that smaller version in your question. — Jeto, Apr 19 '21 at 18:58

score 1 · Answer 1 · answered Apr 21 '21 at 23:33

You've tagged 'character-encoding', so at least you know what the start of the problem is. You've got some ... probably ... UTF8 characters in there and I'm betting some are multi-byte wide. You are counting your 'lines' by exploding on the PHP_EOL character, which I'm guessing is 0x0A. Some of your multi-byte-wide characters contain 0x0A as a single byte of their 'character', so explode (acting on bytes and not multi-byte characters) is treating that as the end of a 'line'. var_dump your exploded array and you'll see the issue easily enough.

Try count(mb_split('(\r?\n)', $text)) and see what you get. My regex is poor though and that might not work. I would see this question for more help on the regex you need to split on a new line:

Match linebreaks - \n or \r\n?

Remember that your line ending might possibly be \u0085, but I doubt it as PHP_EOL is being too aggressive.

If mb_split works, remember that you'll need to be using PHP's mb_ functions for all of your string manipulations. PHP's standard string functions assume single-byte characters and provide the separate mb_ functions to handle multi-byte wide characters.

Problem when reading file with non-English characters in PHP

1 Answers1