3

Is there any command to know the encoding of a file in windows?
like for a file A.txt encoding is UTF-16

Optimus Prime
  • 69
  • 2
  • 13
  • 1
    Possible duplicate of [Get encoding of a file in Windows](http://stackoverflow.com/questions/3710374/get-encoding-of-a-file-in-windows) – Johan Willfred Mar 31 '17 at 08:45
  • Possible duplicate of [How to check if a .txt file is in ASCII or UTF-8 format in Windows environment?](http://stackoverflow.com/questions/6947749/how-to-check-if-a-txt-file-is-in-ascii-or-utf-8-format-in-windows-environment) – STLDev Mar 31 '17 at 08:48
  • atleast read my question. – Optimus Prime Mar 31 '17 at 08:50
  • I am afraid there is no such command... You can however detect whether or not a file contains NULL-bytes in the first line, in which case it is most likely not ASCII-/ANSI-encoded: `(for /F usebackq^ delims^=^ eol^= %L in ("textfile.txt") do @rem/) || echo NULL-bytes detected!` – aschipfl Mar 31 '17 at 11:08
  • You have to know what the encoding is by some communication, specification, documentation or convention. If you are getting the file using something like curl or wget, you can look for the Content-Type response header. In general, no. My system has about 140 different character encodings and depending on the sample file, many, many of them might not error out when decoding the file. It would never be the case that just one would not error out. Or, you can guess. – Tom Blodget Apr 01 '17 at 13:49
  • Possible duplicate of [finding encoding type using batch script](https://stackoverflow.com/q/16235837) – aschipfl Jan 14 '22 at 22:34

1 Answers1

3

In Windows command prompt (cmd), there is no command I know of, that is capable of determining how a text file is encoded.

Nevertheless, I wrote a small batch file that is able to check a few conditions and thus, determine whether a given text file is ASCII-/ANSI-encoded or Unicode-encoded (UTF-8 or UTF-16, Little Endian or Big Endian). At first, it checks whether or not the first (non-empty) line contains zero-bytes, which is an indication that the file is not ASCII-/ANSI-encoded. Next, it checks the first few bytes whether they constitute the Byte Order Mark (BOM) for UTF-8/UTF-16. Since the BOM is optional for Unicode-encoded files, its absence is not a clear sign for an ASCI-/ANSI-encoded file.

So here is the code, featuring a lot of explanatory remarks (rem) -- I hope, it helps:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1" & rem // (provide file via the first command line argument)

rem // Check whether a dedicated file is given (so no wild-cards):
2> nul >&2 (< "%_FILE%" set /P ="" & ver) || (
    rem // The file does not exist:
    >&2 echo The file could not be found, hence there is no encoding!
    exit /B 255
)

rem // Determine the file size:
set "SIZE=" & for %%F in ("%_FILE%") do set "SIZE=%%~zF"
if not defined SIZE (
    rem // The file does not exist:
    >&2 echo The file could not be found, hence there is no encoding!
    exit /B 255
)
if %SIZE% EQU 0 (
    rem // The file is empty:
    >&2 echo The file is empty, hence encoding cannot be determined!
    exit /B 1
)

rem // Store current code page to be able to restore it finally:
for /F "tokens=2 delims=:" %%C in ('chcp') do set /A "$CP=%%C"
rem /* Change to code page 437 (original IBM PC or DOS code page) temporarily;
rem    this is necessary for extended characters not to be converted: */
> nul chcp 437

rem // Attempt to read first line from file; this fails if zero-bytes occur:
(
    rem /* The loop does not iterate over an empty file or one with empty lines only;
    rem    therefore, the behaviour is the same as when zero-bytes occur: */
    for /F usebackq^ delims^=^ eol^= %%L in ("%_FILE%") do (
        rem // Abort reading file after first non-empty line:
        goto :NEXT
    )
) || (
    rem /* The `for /F` loop returns a non-zero exit code in case the file is empty,
    rem    contains empty lines only or the first non-empty line contains zero-bytes;
    rem    to determine whether there are zero-bytes, let `find` process the file,
    rem    which removes zero-bytes or converts them to line-breaks, so `for /F` can
    rem    read the file;
    rem    however, `find` would read the whole file, hence do that only for small
    rem    ones and skip that for large ones, such contains zero-bytes most likely: */
    if %SIZE% LEQ 8192 (
        (
            rem // In case the file contains line-breaks only, the loop does not iterate:
            for /F delims^=^ eol^= %%L in ('^< "%_FILE%" find /V ""') do (
                rem // Abort reading file after first non-empty line:
                goto :ZERO
            )
        ) || (
            rem /* The loop did not iterate, so the file contains line-breaks only;
            rem    restore the initial code page prior to termination: */
            > nul chcp %$CP%
            >&2 echo The file holds only empty lines, hence encoding cannot be determined!
            exit /B 1
        )
    )
)

rem // This point is reached in case the file contains zero-bytes:
:ZERO
rem // Restore the initial code page prior to termination:
> nul chcp %$CP%
>&2 echo NULL-bytes detected in first line, so file is non-ASCII/ANSI!
exit /B 2

rem // This point is reached in case the file does not contain any zero-bytes:
:NEXT
rem /* Build Byte Order Marks (BOMs) for UTF-16-encoded text (Little Endian and Big Endian)
rem    and for UTF-8-encoded text: */
for /F "tokens=1-3" %%A in ('
    forfiles /P "%~dp0." /M "%~nx0" /C "cmd /C echo 0xFF0xFE 0xFE0xFF 0xEF0xBB0xBF"
') do set "$LE=%%A" & set "$BE=%%B" & set "$U8=%%C"

rem /* Reset line string variable, then store first line string (1023 bytes at most);
rem    in contrast to `for /F`, this does not skip over blank lines: */
< "%_FILE%" (set "LINE=" & set /P LINE="")
rem // Check whether the first line of the file begins with any of the BOMs:
if not "%LINE:~,2%"=="%$LE%" if not "%LINE:~,2%"=="%$BE%" if not "%LINE:~,3%"=="%$U8%" goto :CONT
rem /* One of the BOMs has been encountered, hence the file is Unicode-encoded;
rem    restore the initial code page prior to termination: */
> nul chcp %$CP%
>&2 echo BOM encountered in first line, so file is non-ASCII/ANSI!
exit /B 4

rem // This point is reached in case the file does not appear as Unicode-encoded:
:CONT
rem // Restore the initial code page prior to termination:
> nul chcp %$CP%
echo The file appears to be an ASCII-/ANSI-encoded text.

endlocal
exit /B 0
aschipfl
  • 33,626
  • 12
  • 54
  • 99
  • This is excellent example code. +1 However, I am not comfortable saying that if the file is not Unicode encoded that it must be ASCII/ANSI. The result of `chcp` is not a reliable indicator for any given file. The files could be ISO-8859-1, codepage 1254, codepage 949 (Korean), EUC, EBCDIC, etc. – lit Apr 02 '17 at 01:07
  • Thanks, @lit! The code page is changed by `chcp` temporarily in order to correctly build the BOMs... – aschipfl Apr 02 '17 at 09:43
  • That's pretty clever getting `forfiles` to write the binary values. – lit Apr 02 '17 at 19:26
  • Thanks, @lit, that's the only way I know of to get extended characters without having to embed them into the batch file or to use temporary files... – aschipfl Apr 02 '17 at 19:34