Can a UTF-8 file contain some characters which are not UTF-8?

Question

I am trying to import a file into some software, but it complains the file is not saved as UTF-8. I've checked my editor, gedit, and it claims it is being saved as such. I also tried saving as a Windows file, instead of Linux, but this did not help. So, I cut the file into parts, and found, 99% of the file is fine, but somewhere among about 3 lines of text, something is making the software upset. The file has many different languages in it, so lots of unusual symbols. Is it possible for some symbols in a document to not be from UTF-8?

possible duplicate of [UTF-8 validation](http://stackoverflow.com/questions/115210/utf-8-validation) — unwind, Jan 10 '12 at 12:31

score 2 · Answer 1 · edited May 23 '17 at 12:12

Can a UTF-8 file contain some characters which are not UTF-8?

No because then it won't be a UTF-8 file.

I also tried saving as a Windows file, instead of Linux, but this did not help.

Both Windows and Unix line endings are irrelevant to UTF-8.

The file has many different languages in it, so lots of unusual symbols. Is it possible for some symbols in a document to not be from UTF-8?

No. All symbols (Unicode codepoints) are representable by UTF-8. However it is possible that some bytes in the file aren't valid UTF-8 encoded.

It's unlikely that Gedit would output invalid UTF-8 if it was claiming to do so, so there's a few possibilities:

A unicode marker is being used that the importing software can't read.
A unicode marker is not used, and the important software expects one.
The importing software isn't parsing UTF-8 correctly.
The importing software doesn't recognise all code points. See rodrigo's answer for more on this.

I have narrowed the problem down to a single letter character "Ａ" at the very beginning of the file (the first character in the first line). It only causes problems if placed in the first line of the file. If placed anywhere else, there is no problem and the file imports successfully. — Village, Jan 10 '12 at 12:58
@Village: It's possible this is part of some byte order marker (possibility 1), but you'll need to provide more information. — Matt Joiner, Jan 10 '12 at 15:57

score 2 · Accepted Answer · answered Jan 10 '12 at 14:09

The character "Ａ" that you mention in a comment is:

U+FF21 FULLWIDTH LATIN CAPITAL LETTER A

And in UTF-8 is encoded as:

0xEF 0xBC 0xA1

You can check whether these are the bytes you have in the file (most likely).

If so, then it is a bug in your software. Maybe it tries to autodiscover the encoding or type of the file by looking to the first bytes of the file, and it gets confused somehow.

Maybe it sees the first byte (0xEF) and it cluelessly expects a BOM (Byte Order Mark), which is UTF-8: 0xEF 0xBB 0xBF. But it is not there, so it throws an error.

score 1 · Answer 3 · answered Jan 10 '12 at 12:33

Some programs do not treat some of the peculiarities of UTF-8 properly.

For example, some programs fail to read/write surrogate pairs properly as a single UTF-8 codepoint, and instead write/expect two separated UTF-8 codepoints for each of the pair.

Some programs cannot handle codepoints outside of the BMP that is the first 64K characters at all.

You should check if your file has any of these.

Can a UTF-8 file contain some characters which are not UTF-8?

3 Answers3