I have .txt
and .java
files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?
-
possible duplicate of [How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII](http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and-a) – tchrist Nov 23 '10 at 11:19
7 Answers
If you're on Linux, try file -i filename.txt
.
$ file -i vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
For reference, here is my environment:
$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic
Some file
versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:
$ file -I vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii
Also, have a look here.
Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.

- 6,882
- 18
- 49
- 63
You can't reliably detect the encoding from a textfile - what you can do is make an educated guess by searching for a non-ascii char and trying to determine if it is a unicode combination that makes sens in the languages you are parsing.

- 19,708
- 3
- 45
- 61
See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.
In a text file there is no header that saves the encoding or so. You can try the linux/unix command find
which tries to guess the encoding:
file -i unreadablefile.txt
or on some systems
file -I unreadablefile.txt
But that often gives you text/plain; charset=iso-8859-1
although the file is unreadable (cryptic glyphs).
This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv
. First I tried all encodings, displaying (grep
) a line that contained the word www. (a website address):
for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less
This last commandline shows the the tested file encoding and then the translated/transcoded line.
There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:
ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt
In my case it was a chinese windows encoding, which is now readable (if you know chinese).

- 2,278
- 1
- 23
- 30
Does there exist any program to determine the file encoding or to see the encoding?
This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I
command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:
% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary
somefile.asc
was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit
- a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?
if you are using python, the chardet package is a good option, for example
from chardet.universaldetector import UniversalDetector
files = ['a-1.txt','a-2.txt']
detector = UniversalDetector()
for filename in files:
print(filename.ljust(20), end='')
detector.reset()
for line in open(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
gives me as a result:
a-1.txt {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}