0

Notepad has an option to save as ANSI, but it does not seem to work, at least not in the versions I have tried, see below.

enter image description here

When I choose this option Unicode code points are still rendered, not ANSI. The option seems pretty intuitive. Am I misunderstanding how this is supposed to work? Do I need to do something else first?

For example, if I paste the following text into Notepad, with the save as ANSI option selected in Notepad, Unicode Code Points like curvy quotes are rendered anyway.

1.  This is a – long dash
2.  “Smart Quotes”
3.  ‘Smart Quotes’

•   Copyright symbol © 
•   Fraction ¾

The functionality I am looking for does exist in other text editors, eg, Notepad++. I would like for the text to appear like this:

1.  This is a – long dash
2.  “Smart Quotesâ€
3.  ‘Smart Quotes’

• Copyright symbol © 
• Fraction ¾

The above was achieved by switching encoding in Notepad++ enter image description here

Note: I only show Notepad++ as an example of how I think this Notepad should (used to?) work. Unfortunately I am stuck with Notepad.

Edit I would also be ok with question mark replacements, something like:

1.  This is a ?? long dash
2.  ??Smart Quotes??
3.  ??˜Smart Quotes??

??   Copyright symbol ??
??   Fraction ??

I believe the above is how Notepad used to work.

sse
  • 987
  • 1
  • 11
  • 30
  • 3
    ANSI is not the same thing as ASCII. ANSI can still render certain non-ASCII Unicode characters, depending on the particular ANSI codepage that is being used. Your Notepad++ example is displaying UTF-8 encoded text as if it were ANSI instead of UTF-8. – Remy Lebeau May 08 '19 at 20:23
  • ANSI doesn't suck enough to make you happy. All of the characters you tried to make it trip up do in fact have a valid character code. https://en.wikipedia.org/wiki/Windows-1252#Character_set Since it is the default code page on your machine, they also show up correctly. You'll have to save as UTF8 and write a program to mangle it, any C or C++ program usually qualifies without any help. – Hans Passant May 08 '19 at 22:04
  • @HansPassant Thank you for pointing this out. There are indeed characters in myu example that are not in the 1252 character set, eg, long dash, curvy double quotes, curvy single quotes, right? – sse May 09 '19 at 19:07
  • @RemyLebeau I removed the reference to ASCII, I meant to write ANSI. – sse May 09 '19 at 19:10
  • "ANSI" is a misnomer. It's a character set (or collection of character sets) similar to Latin-1. Microsoft submitted it to ANSI for standardization, but it never became an ANSI standard. It's an 8-bit extended ASCII that includes characters like opening and closing double quotes and em-dash. – Keith Thompson May 09 '19 at 19:36

1 Answers1

0

The short answer is it does work, and you are indeed saving the file as ANSI. Now for the long answer.


When I choose this option Unicode code points are still rendered, not ANSI.

First, to be precise, ANSI is not a singular fixed encoding, but the given specifics in this question, it's consistent with ANSI = Windows-1252 which I'll assume for the rest of this answer.

Second, character sets are not mutually exclusive. In this case, all the characters you have demonstrated (en-dash, various smart quotes, bullet point etc.) exist in both Unicode and Windows-1252. So it's fully expected that these characters are being correctly handled when you save it as ANSI, or indeed would be in any Unicode encoding.

The functionality I am looking for does exist in other text editors, eg, Notepad++. I would like for the text to appear like this:

1.    This is a – long dash
2.    “Smart Quotesâ€
3.    ‘Smart Quotes’

•   Copyright symbol © 
•   Fraction ¾

Why do you want this? It's mojibake which is usually something people are seeking to fix, not reproduce. I don't ask this to be difficult, but answering why you want this reproduced might lead to different resolutions to accomplish the same goal.

The above was achieved by switching encoding in Notepad++.

Yes, you've switched the encoding from UTF-8 to ANSI. Text files don't themselves have an inherent encoding, rather an encoding is used while reading and writing text files. Notepad++ defaults to UTF-8, so as you are initially typing, that's the character encoding being used to write the text. Then when you switch to ANSI, you are reading the underlying data under the new encoding, which is not what you wrote it in.

To take just the bullet point character as a concrete example, in UTF-8, a bullet point character is represented by the three bytes E2 80 A2. But in Windows-1252, E2 means "â", 80 means "€" and A2 means "¢", which is why you are seeing those exact characters in place of the bullet point character when interpreting the text as ANSI.

Note: I only show Notepad++ as an example of how I think this Notepad should (used to?) work.

It's possible Notepad used to work like that in previous versions, though I would have to guess it would have needed to be a really old version before Unicode support. Note that Notepad basically guesses at the intended encoding of the file to decide what to show you and that guessing algorithm has been updated over the years. See for example the infamous "Bush hid the facts" bug.

I would also be ok with question mark replacements

The question mark is a common replacement character when the underlying data is not compatible with the encoding being used for reading, in non-Unicode contexts anyways. If you can get Notepad to interpret text as Windows-1252, if you throw in an undefined byte (in Windows-1252, only bytes 81, 8D, 8F, 90, and 9D), you might be able to get question marks there.

DPenner1
  • 10,037
  • 5
  • 31
  • 46