56

I have .txt and .java files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?

erik
  • 2,278
  • 1
  • 23
  • 30
Ballon
  • 6,882
  • 18
  • 49
  • 63
  • possible duplicate of [How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII](http://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and-a) – tchrist Nov 23 '10 at 11:19

7 Answers7

62

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

bneely
  • 9,083
  • 4
  • 38
  • 46
mpenkov
  • 21,621
  • 10
  • 84
  • 126
31

Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.

Ballon
  • 6,882
  • 18
  • 49
  • 63
13

You can't reliably detect the encoding from a textfile - what you can do is make an educated guess by searching for a non-ascii char and trying to determine if it is a unicode combination that makes sens in the languages you are parsing.

Nikolaus Gradwohl
  • 19,708
  • 3
  • 45
  • 61
4

See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
1

In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:

file -i unreadablefile.txt

or on some systems

file -I unreadablefile.txt

But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).

This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

This last commandline shows the the tested file encoding and then the translated/transcoded line.

There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

In my case it was a chinese windows encoding, which is now readable (if you know chinese).

erik
  • 2,278
  • 1
  • 23
  • 30
0

Does there exist any program to determine the file encoding or to see the encoding?

This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?

0

if you are using python, the chardet package is a good option, for example

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

gives me as a result:

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}