fgets a UTF-8 txt file returns rubbish letters and true when file is blank

Question

I assume that this is due to the UTF-8 txt file format. The txt file is totally empty and when I tried fgets($file_handle), I get these rubbish letters:

How do I fix this? I want to check if the file is empty by using:

if ( !$file_data = fgets($file_handle) )
    // This code runs if file is empty

EDIT

This is a new file using encoding UTF-8:

Check out this answer for better ways to check if a file is empty: https://stackoverflow.com/questions/4857182/best-way-to-determine-if-a-file-is-empty-php — Bananaapple, Feb 26 '19 at 09:07
@Bananaapple `file_get_contents()` gives me the same rubbish letters. `file_size` returns `2` when it's totally blank. Both are not really viable, are they? How do I know that the size is gonna be `2` when it's clearly empty? What if it's like `4` or `16`? — Richard, Feb 26 '19 at 09:11
All this suggests that while the file appears empty it is in fact not. Try creating an empty file and testing against that - `touch somefilename` on linux based systems. I would expect that to correctly show as empty so really what you want to look at is not why the empty check fails but why your file has data in it. — Bananaapple, Feb 26 '19 at 09:25
@Bananaapple I've just created a new file again with encoding UTF-8. Look at the picture I've inserted in my OP. It's 1KB in size **even when it's a new file**. I've also mentioned that I suspect this is because of the encoding UTF-8 (now I'm pretty sure it is). — Richard, Feb 26 '19 at 09:27
How exactly are you creating your files? If it's through PHP it would probably help to add the relevant code to your question :-) — Bananaapple, Feb 26 '19 at 09:32
@Bananaapple It was created using Notepad. I saved the file normally and changed the encoding to UTF-8. — Richard, Feb 26 '19 at 09:35

score 2 · Accepted Answer · answered Feb 26 '19 at 09:59

This has to do with the BOM (Byte Order Mark) added by Notepad to detect the encoding:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.

From this article you can also see that:

The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF

We should therefore be able to write a PHP function to account for this:

function is_utf8_file_empty($filename)
{
    $file = @fopen($filename, "r");
    $bom  = fread($file, filesize($filename));

    if ($bom == b"\xEF\xBB\xBF") {
        return true;
    }

    return false;
}

Do be aware that this is specific for files created in the manner you described and this is just example code - you should definitely test this and possible modify it to allow it to better handle large files / files that are completely empty etc

Very interesting., from the link you provided above, I presume that if I were to deal with UTF-16, I'd have to check for these: `0xFE 0xFF`? — Richard, Feb 27 '19 at 02:04
There seem to be different BOMs for UTF-16 depending on the endianness so probably best to trial and error it against a UTF-16 file you intend to use or alternatively code it to be able to cope with all variations. — Bananaapple, Feb 27 '19 at 08:17
I see. That's kind of a hassle. Nevertheless, thank you for the answer! Have a good day. — Richard, Feb 27 '19 at 08:35

fgets a UTF-8 txt file returns rubbish letters and true when file is blank

1 Answers1