How to determine encoding table of a text file

Question

I have .txt and .java files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?

possible duplicate of [How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII](http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and-a) — tchrist, Nov 23 '10 at 11:19

score 62 · Accepted Answer · edited Aug 29 '17 at 23:46

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

score 31 · Answer 2 · answered Nov 23 '10 at 11:27

31

Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.

answered Nov 23 '10 at 11:27

Ballon

6,882
18
49
63

score 13 · Answer 3 · answered Nov 23 '10 at 11:15

13

You can't reliably detect the encoding from a textfile - what you can do is make an educated guess by searching for a non-ascii char and trying to determine if it is a unicode combination that makes sens in the languages you are parsing.

answered Nov 23 '10 at 11:15

Nikolaus Gradwohl

19,708
3
45
61

score 4 · Answer 4 · edited May 23 '17 at 12:18

See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.

score 1 · Answer 5 · answered Dec 10 '16 at 14:46

In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:

file -i unreadablefile.txt

or on some systems

file -I unreadablefile.txt

But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).

This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

This last commandline shows the the tested file encoding and then the translated/transcoded line.

There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

In my case it was a chinese windows encoding, which is now readable (if you know chinese).

score 0 · Answer 6 · 2020-12-07T07:16:59.000

Does there exist any program to determine the file encoding or to see the encoding?

This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?

score 0 · Answer 7 · answered Jan 18 '23 at 23:17

if you are using python, the chardet package is a good option, for example

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

gives me as a result:

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

How to determine encoding table of a text file

7 Answers7

Linked