Simple way to convert txt file from UTF-8 to ASCII

Question

I am trying to convert just one file from UTF-8 to ASCII. I found the following script online, and it creates the Out-File but it does not change the encoding to ASCII. Why is this not working?

Get-Content -Path "File/Path/to/file.txt" | Out-File -FilePath "File/Path/to/processed.txt" -Encoding ASCII

Because at the bottom of the text file it says UTF-8. I was under the impression that at the bottom of a txt file it would say ASCII (ANSI) if it was ASCII — Crimp, Oct 25 '22 at 20:43
As an aside: ANSI is a different (group of) encoding(s), each of them also a _superset_ of ASCII, as UTF-8 is. — mklement0, Oct 25 '22 at 21:01

mklement0 · Accepted Answer · 2022-11-03T15:33:54.053

tl;dr

-Encoding ASCII does work, though your editor's GUI may still report the resulting file as UTF-8-encoded, for the reasons explained below.

First, a general caveat:

If your input file also contains non-ASCII-range characters, they will be transliterated to verbatim ?, i.e. you'll potentially lose information.
Conversely, if your input files are UTF-8-encoded but do not contain non-ASCII characters, they in effect already are ASCII-encoded files; see below.

ASCII encoding is a subset of UTF-8 encoding (except that ASCII encoding never involves a BOM).

Therefore, any (BOM-less) file composed exclusively of bytes representing ASCII characters is by definition also a valid UTF-8 file.

Modern editors default to BOM-less UTF-8; that is, if a file doesn't start with a BOM, they assume that it is UTF-8-encoded, and that's what their GUIs reflect - even if a given file happens to be composed of ASCII characters only.

To verify that your output file is indeed only composed of ASCII characters, use the following:

# This should return $false; '\P{IsBasicLatin}' matches any NON-ASCII character.
(Get-Content -Raw File/Path/to/processed.txt) -cmatch '\P{IsBasicLatin}'

For an explanation of this test, especially with respect to needing to use -cmatch, the case-sensitive variant of the -match operator, see this answer.

A complete example:

# Write a string that contains non-ASCII characters to a
# file with -Encoding Ascii
# The resulting fill will contain 1 line, with content 'caf?'
# That is, the "é" character was "lossily" transliterated to (ASCII) "?"
'café' | Out-File -Encoding Ascii temp.txt

# Examining the file for non-ASCII characters now indicates that
# there are none, i.e, $false is returned.
(Get-Content -Raw temp.txt) -cmatch '\P{IsBasicLatin}'

Interesting! Thank you! That update did return `processed.txt` as false this time around. Out of curiosity, I ran `(Get-Content -Raw File/Path/to/file.txt) -cmatch '\P{IsBasicLatin}'` on the original file (`file.txt`), and that also returned false. Why would this be? — Crimp, Nov 02 '22 at 12:25
@Crimp, that implies that the original file contained no non-ASCII characters either (though, at least in principle, it could use a different character encoding, such as UtF-16LE ("Unicode"), which PowerShell automatically supports if such a file starts with a BOM). Btw, it's both `i` and `k` characters that cause problems with `-match`; you can find an explanation in [this answer](https://stackoverflow.com/a/63023639/45375). — mklement0, Nov 02 '22 at 13:28

Simple way to convert txt file from UTF-8 to ASCII

1 Answers1