4

The basic question is, how does notepad (or other basic text editors) store data. I ran into this because I was trying to compare file size of different compression techniques, and realized something isn't quite right.

To elaborate..

If I save a text file with the following contents:

a

The file is 1 byte. This one happens to be 97, or 0x61.

I create a text file with the following contents:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Which is all the characters from 0-255, or 0x00 to 0xFF. This file is 256 bytes. 1 byte for each character. This makes sense to me.

Then I append the following character to the end of the above string.

A character not contained in the above string. All 8 bit characters were already used. This character is 8224, or 0x2020. A 2 bytes character.

And yet, the file size has only changed from 256 to 257 bytes. In fact, the above character saved by itself only shows 1 byte.

What am I missing?

Edit: Please note that in the second text block, many of the characters do not show up on here.

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
qoou
  • 155
  • 8
  • 1
    What are you using to check the size of the file? – Chris Martin May 17 '16 at 05:05
  • @ChrisMartin Windows explorer / file properties – qoou May 17 '16 at 05:11
  • Whatever tool you're using to report the size is, at least apparently, reporting the size in characters. – David Schwartz May 17 '16 at 05:17
  • @DavidSchwartz What tool should I be using to view file size? Again I'm just using a basic OS properties viewer. – qoou May 17 '16 at 05:31
  • it depends on the charset used to store the file. And one important thing is that codepoints less than 32 are not displayable because they're control codes, so your text above isn't 256 characters. Moreover Windows program may stop processing text files when they see 0x1a – phuclv May 17 '16 at 05:34
  • [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – Remy Lebeau May 18 '16 at 19:55

2 Answers2

4

In ANSI encoding (This 8-bit Microsoft-specific encoding), you save each character in one byte (8-bit).

ANSI also called Windows-1252, or Windows Latin-1

You should have a look at ANSI table in ANSI Character Codes Chart or Windows-1252

So for character, its code is 134, byte 0x86.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Siyavash Hamdi
  • 2,764
  • 2
  • 21
  • 32
  • Not all Ansi encodings are limited to 1 byte per character. *Certain* encodings are, like `Windows-1252` and `Latin-1`, but there are also multi-byte Ansi encodings available as well. – Remy Lebeau May 18 '16 at 19:52
  • @RemyLebeau, would you please give a reference? – Siyavash Hamdi May 18 '16 at 19:59
  • Any Ansi encoding that uses lead bytes and trail bytes for multi-byte sequences, such as DBCS codepages used for Chinese and Japanese. – Remy Lebeau May 18 '16 at 20:16
3

Using one byte to encode a character only makes sense on the surface. Works okay if you speak English, it is a fair disaster is you speak Chinese or Japanese. Unicode today has definitions for 110,187 typographic symbols with room to grow up to 1.1 million. A byte is not a good way to store a Unicode symbol since it can encode only 256 distinct values.

Accordingly, text editors must always encode text when they store it to a file. Encoding is required to map 110,187 values onto a byte-oriented storage medium. Inevitably that takes more than 1 byte per character if you speak Chinese.

There have been lots and lots of encoding schemes in common use. Popular in the previous century were code pages, a scheme that uses a character set. A language-specific mapping that tries as hard as it can to need only 1 byte of storage per character by picking 256 characters that are likely to be needed in the language. Japanese, Korean and Chinese used a multi-byte mapping because they had to, other languages used 1.

Code pages have been an enormous disaster, a program cannot properly read a text file that was encoded in another language's code page. It worked when text files stayed close to the machine that created it, the Internet in particular broke that usage. Japanese was particularly prone to this disaster since it had more than one code page in common use. The result is called mojibake, the user looks at gibberish in the text editor. Unicode came around in 1992 to try solve this disaster. One new standard to replace all the other ones, tends to invoke another kind of disaster.

You are subjected to that kind of disaster, particularly if you use Notepad. A program that tries to be compatible with text files that were created in the past 30 years. Google "bush hid the facts" for a hilarious story about that. Note the dialog you get when you use File > Save As, the dialog has an extra combobox titled "Encoding". The default is ANSI, a broken name from the previous century that means "code page". As you found out, that character indeed only needed 1 byte in your machine's default code page. Depends where you live, it is 1252 in Western Europe and the Americas. You'd get 0x86 if you look at the file with a hex viewer.

Given that the dialog gives you a choice and you should not favor ANSI's mojibake anymore, always favor UTF-8 instead. Maybe they'll update Notepad some day so it uses a better default, very hard to do.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536