16

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).

But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.

Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx

But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx

And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.

So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?

And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?

jake.libber
  • 163
  • 1
  • 1
  • 5

2 Answers2

8

You might have to experiment here. This is a great question, and I'm not 100% confident, but:

So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?

The "OEM codepage", whatever that is for the system.

Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?

No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)

And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?

The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.

Some citations for these answers:

First, this page tells us that the OEM code page != the Windows code page:

Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.

On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).

If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:

ASDFΣ.TXT    asdfΣ.txt

However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:

ASDF~1.TXT   asdf?.txt

(You'll likely see ?, because cmd.exe's font cannot display a Λ.)

For information about long filenames, see this Wikipedia article.

Also, interestingly, if you name a file "asdf©.txt", you might get:

ASDFC.TXT    asdfc.txt

… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:

ASDFC.TXT    asdf©.txt

And this is why you should use the FooW functions.

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • 1
    Thanks a lot for your answer. The main problem is solved I think: Unicode filenames are also stored on FAT. But just out of curiosity I'd like to clarify several more points: – jake.libber Oct 22 '13 at 20:54
  • Is actual OEM codepage stored somewhere on FAT volume (not the page itself but its id, of course)? Or actual OEM codepage is unimportant (because they are all SBCSs) and is only needed to display filenames correctly in console? – jake.libber Oct 22 '13 at 21:08
  • What happens with unicode surrogate pairs in LFN filenames? Are they truncated to single 2-byte words (bcause actual LFN codepage is UCS-2 and not UTF-16) or stored as-is? – jake.libber Oct 22 '13 at 21:08
  • I'm guessing that "which OEM codepage" is not stored in FAT, but rather, that it's just assumed. I would look at the FAT specs themselves — if it is stored, that's what would tell you. I took a quick peek at FAT's layout, and couldn't find anything appropriate. Also, the Linux fat drivers have an option to specify the code page, but assumes CP437 otherwise. (See `man mount`.) – Thanatos Oct 22 '13 at 22:57
  • 2
    LFN filenames are stored as UTF-16, meaning that characters outside the BMP should be stored as surrogate pairs. The Wikipedia article on LFN says UTF-16, so let's hope it is accurate. Windows does allow me to use non-BMP characters in filenames on FAT, and these would have to be encoded using surrogate pairs. – Thanatos Oct 22 '13 at 23:05
  • These links might help add some clarity: http://blogs.msdn.com/b/oldnewthing/archive/2011/08/26/10200583.aspx and http://home.teleport.com/~brainy/lfn.htm – rkagerer Aug 01 '14 at 07:47
2

The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

Igor Skochinsky
  • 24,629
  • 2
  • 72
  • 109
  • Thanks for pointing out to VFAT LFN article! It clarifies some details with actual codepage (so it's UCS-2). But @Thanatos provided more extensive explanation on the topic. – jake.libber Oct 22 '13 at 20:52
  • Although the documentation says that FAT32 is UCS-2, it is likely that Explorer actually treats them as UTF-16. (I don't believe the console window supports the full Unicode set because it is based on a 16-bit-per-character array; it presumably supports only UCS-2.) – Harry Johnston Oct 23 '13 at 01:29