24

What does it mean when I save a text file as "Unicode" in notepad? is it Utf-8, Utf-16 or Utf-32? Thanks in advance.

FSm
  • 2,017
  • 7
  • 29
  • 55
  • Probably UTF-8, as that is the most common. – Linuxios Dec 15 '12 at 18:24
  • 1
    @ Linuxios, but there is another choice named Utf-8 in notepad save file !! How could be duplicated ?? – FSm Dec 15 '12 at 18:26
  • possible duplicate of [What is Java's equivalent of Windows Notepad "Unicode Encoding"?](http://stackoverflow.com/questions/13602440/what-is-javas-equivalent-of-windows-notepad-unicode-encoding) – Esailija Dec 15 '12 at 19:28

2 Answers2

34

In Notepad, as in Windows software in general, “Unicode” as an encoding name means UTF-16 Little Endian (UTF-16LE). (I first thought it’s not real UTF-16, because Notepad++ recognizes it as UCS-2 and shows the content as garbage, but re-checking with BabelPad, I concluded that Notepad can encode even non-BMP characters correctly.)

Similarly, “Unicode big endian” means UTF-16 Big Endian. And “ANSI” means the system’s native legacy encoding, e.g. the 8-bit windows-1252 encoding in Western versions of Windows.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • @ Jukka K. Korpela... Which one will cover the most characters?? the Unicode or the UTF-8 ? – FSm Dec 15 '12 at 18:49
  • 3
    UTF-16 and UTF-8 cover exactly the same characters; they are just two transfer encodings for Unicode. Windows uses the name “Unicode” for UTF-16 just because it internally uses UTF-16 for Unicode. – Jukka K. Korpela Dec 15 '12 at 18:51
  • 3
    @Qaesar Every UTF can encode all of Unicode. – melpomene Dec 15 '12 at 18:51
  • @ Jukka K. Korpela, I'm dealing with an Indo - Eroupian language called Kurdish language. in order to good text precessing, what kind of Unicode I should save my file ? the Unicode or the UTF-8?? are they same? – FSm Dec 15 '12 at 18:59
  • 2
    @Qaesar, any reasonable text processing software can read both UTF-16 (Windows “Unicode”) and UTF-8 and will convert to its internal representation if needed. If you write your own program code, you just need to select suitable input reading routines from a library. So it does not really matter much. Windows software internally uses UTF-16. But for web pages, UTF-8 should be used (UTF-16 is poorly supported by browsers and search engines). – Jukka K. Korpela Dec 15 '12 at 19:17
  • utf8everywhere.org summarizes it all. Unfortunately, notepad does the wrong thing by default. Yet, you can save the standard UTF-8 files by selecting the right option when saving. – Pavel Radzivilovsky Dec 16 '12 at 08:25
-4

All of these formats are "Unicode". But usually editors on Mac and Windows mean UTF-8 with that because it is ASCII compatible below code 128 IIRC. UTF-8 can represent more codes than just 256 (which fits in a single byte of 8 bits) by using a special character which means that the following byte also belongs to the same character.

If you look at the output in terminal, say with vi, and if you see a space between every two characters then you are looking at UTF-16 because there every two bytes make up one character. What you should see is that the characters don't have spaces between them, that's an indication for UTF-8.

Cocoanetics
  • 8,171
  • 2
  • 30
  • 57
  • 1
    @ Cocoanetics, if the editors meant UTF-8, so why there is an another choice named UTF-8 in the notepad save file ? – FSm Dec 15 '12 at 18:34
  • ... because Windows is weird. When Windows started to tip its toes into unicode they first embraced UTF16 (see the other answer) and called this "Unicode". Though over the long run UTF8 started to be used everywhere and has become the de facto standard. – Cocoanetics Dec 15 '12 at 22:32