How can I find encoding of a file via a script on Linux?

Question

I need to find the encoding of all files that are placed in a directory.
Is there a way to find the encoding used?

The file command is not able to do this.

The encoding of interest to me is ISO 8859-1.
If the encoding is anything else, I want to move the file to another directory.

If you have an idea of what kind of scripting language you might want to use, tag your question with the name of that language. That might help... — Tyler, Apr 30 '09 at 05:35
Sorry i did not make myself clear. I was looking at building a shell script as scraimer mentioned. Shall make myself clearer henceforth. Thanks, Manglu — Manglu, May 01 '09 at 00:05
Maybe not related to this answer, but a tip in general: When you can describe your entire doubt in one word ("encoding", here), just do `apropos encoding`. It searches the titles and descriptions of all the manpages. When I do this on my machine, I see 3 tools that might help me, judging by their descriptions: `chardet`, `chardet3`, `chardetect3`. Then, by doing `man chardet` and reading the manpage tells me that `chardet` is just the utility I need. — John Red, May 18 '16 at 11:37
The encoding might change when you change content of a file. e.g In vi, when write a simple c program, it's probably `us-ascii`, but after add a line of Chinese comment, it becomes `utf-8`. `file` can tell the encoding by reading the file content & guess. — Eric, Sep 03 '16 at 17:17
I just ran chardetect, chardet3, chardetect3 and uchardet on a file that I explicitly saved as UTF-8. They keep telling me the encoding was "windows-1252 with confidence 0.4641618497109827". file and enca correctly tell me it's UTF-8. Maybe you need to tell chardet what encoding to try, but I wasn't able to figure out how to do that, because the "--help" output of chardet is garbage. — Algoman, May 03 '18 at 08:27

score 554 · Accepted Answer · edited May 28 '21 at 23:34

554

It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.

Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)

edited May 28 '21 at 23:34

Peter Mortensen

30,738
21
105
131

answered Apr 30 '09 at 05:41

Shalom Craimer

20,659
8
70
106

That doesn't appear to support 8859-1 (just from a cursory glance at that man page). – paxdiablo Apr 30 '09 at 05:47
1

According to the man page, it knows about the ISO 8559 set. Perhaps read a little less cursorily :-) – bignose Apr 30 '09 at 06:12
1

8859-2,4,5,13 and 16, no mention of 8859-1. The glyphs above 0x7f are very different between the -1 and -2 variants. – paxdiablo Apr 30 '09 at 06:21
Hi, I work in an AIX environment and enca does not seem to exist in this environment. Thanks, Manglu – Manglu May 01 '09 at 00:42
10

Enca sounds interesting. Unfortunately detection seems to be very language dependant and the set of supported languages is not very big. Mine (de) is missing :-( Anyway cool tool. – er4z0r Apr 05 '10 at 12:22
It doesn't support the Japanese encodings. Running ``enca --list languages`` shows mostly russian and east european (plus a few chinese) encodings. OH well! It's back to google then. – GuruM Jun 19 '12 at 14:11
1

Good post on tools like [enca, enconv, convmv](http://osmanov-dev-notes.blogspot.in/2010/07/how-to-handle-filename-encodings-in.html?showComment=1340093042160#c8103931008301575520) – GuruM Jun 19 '12 at 14:18
9

`enca` appears to be completely useless for analyzing a file written in English, but if you happen to be looking at something in Estonian, it might solve all your problems. Very helpful tool, that... – cbmanica Apr 16 '13 at 20:36
1

Use `file -b --mime-encoding YOURFILE` to print encoding only – Ivan Chau Sep 04 '13 at 08:02
1

I use this command to test if it is utf-8: `if file --brief --mime-encoding YOURFILE | grep -q "utf-8"; then echo "I am utf-8 encoded"; else echo "I am different"; fi` Thank you @IvanChau – Petr Újezdský Jul 30 '15 at 14:24
my index.php is 100% utf-8 but `file -i` shows `us-ascii` – vladkras Nov 21 '16 at 11:42
9

@vladkras if there are no non-ascii chars in your utf-8 file, then it's indistinguishable from ascii :) – vadipp Mar 02 '17 at 10:36
@vadipp even if I create new utf-8 file with `'abc'` inside it shows `myfile.txt: text/plain; charset=us-ascii` – vladkras Mar 02 '17 at 12:20
1

@vladkras that's what I said: `a`, `b` and `c` are all ascii chars, and their utf-8 representation is the same as ascii. If you are curious, go read how utf-8 encoding works. – vadipp Mar 02 '17 at 17:17
@vadipp missed double `no` in your comment, anyway phpstorm detects that files like non-utf8, while `file -i` shows ascii – vladkras Mar 03 '17 at 15:46

score 113 · Answer 2 · answered Jul 27 '12 at 05:39

113

file -bi <file name>

If you like to do this for a bunch of files

for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done

answered Jul 27 '12 at 05:39

madu

1,147
1
7
2

However, if the file is an xml file, with the attribute "encoding='iso-8859-1' in the xml declaration, the file command will say it's an iso file, even if the true encoding is utf-8... – Per Sep 11 '12 at 07:37
6

Why do you use the -b argument? If you just do file -i * it outputs the guessed charset for every file. – Dr. Hans-Peter Störr Jun 26 '13 at 10:28
4

I was curious about the -b argument too. The man page says it means "brief" `Do not prepend filenames to output lines` – craq Jan 12 '16 at 16:47
6

There's no need to parse file output, `file -b --mime-encoding` outputs just the charset encoding – jesjimher Apr 18 '18 at 12:49
2

all I get is "regular file" as output when executing this – Robert Sinclair Dec 28 '19 at 08:22

qwert2003 · Answer 3 · 2021-05-30T09:00:03.637

62

uchardet - An encoding detector library ported from Mozilla.

Usage:

~> uchardet file.java
UTF-8

Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.

edited May 30 '21 at 09:00

answered Dec 29 '15 at 00:38

qwert2003

801
6
9

2

Thanks! I'm not delighted about yet more packages, yet `sudo apt-get install uchardet` is so easy that I decided not to worry about it... – sage Mar 09 '16 at 21:33
As I just said in a comment above: uchardet falsely tells me the encoding of a file was "windows-1252", although I explicitly saved that file as UTF-8. uchardet doesn't even say "with confidence 0.4641618497109827" which would at least give you a hint that it's telling you complete nonsense. file, enca and encguess worked correctly. – Algoman May 03 '18 at 08:40
`uchardet` has a big advantage over `file` and `enca`, in that it analyses the whole file (just tried with a 20GiB file) as opposed to only the beginning. – tuxayo Jan 20 '20 at 02:06

not2qubit · Answer 4 · 2022-06-22T16:14:39.897

32

In Debian you can also use: encguess:

$ encguess test.txt
test.txt  US-ASCII

As it is a perl script, it can be installed on most systems, by installing perl or the script as standalone, in case perl has already been installed.

$ dpkg -S /usr/bin/encguess
perl: /usr/bin/encguess

edited Jun 22 '22 at 16:14

answered Feb 21 '18 at 18:49

not2qubit

14,531
8
95
135

2

I installed `uchardet` in Ubuntu and it told me that my file was `WINDOWS-1252`. I know this was wrong because I saved it as UTF-16 with Kate, to test. However, `encguess` guess correctly, and it was pre-installed in Ubuntu 19.04. – Nagev Jun 11 '19 at 11:32
1

Excellent, works perfectly. I add one little tip: In ubuntu/debian enguess it's inside the perl package. If you have this package installed and it doesn't works, try with `/usr/bin/encguess` – NetVicious Jun 18 '21 at 09:56
1

`encguess` is also available via `git-bash` on `windows` as well – MD. Mohiuddin Ahmed Aug 12 '21 at 10:00

score 11 · Answer 5 · edited May 28 '21 at 23:48

11

Here is an example script using file -I and iconv which works on Mac OS X.

For your question, you need to use mv instead of iconv:

#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
  encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
  case $encoding in
    iso-8859-1)
    iconv -f iso8859-1 -t utf-8 $f > $f.utf8
    mv $f.utf8 $f
    ;;
  esac
done

edited May 28 '21 at 23:48

Peter Mortensen

30,738
21
105
131

answered Feb 08 '16 at 16:53

Wolfgang Fahl

15,016
11
93
186

11

`file -b --mime-encoding` outputs just the charset, so you can avoid all pipe processing – jesjimher Apr 18 '18 at 12:50
1

Thx. As pointed out on MacOS this won't work: file -b --mime-encoding Usage: file [-bchikLNnprsvz0] [-e test] [-f namefile] [-F separator] [-m magicfiles] [-M magicfiles] file... file -C -m magicfiles Try `file --help' for more information. – Wolfgang Fahl Apr 19 '18 at 05:07

score 11 · Answer 6 · edited May 28 '21 at 23:53

11

To convert encoding from ISO 8859-1 to ASCII:

iconv -f ISO_8859-1 -t ASCII filename.txt

edited May 28 '21 at 23:53

Peter Mortensen

30,738
21
105
131

answered Feb 18 '19 at 12:29

fimbulwinter

111
1
2

score 7 · Answer 7 · edited May 28 '21 at 23:56

7

With this command:

for f in `find .`; do echo `file -i "$f"`; done

you can list all files in a directory and subdirectories and the corresponding encoding.

If files have a space in the name, use:

IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done

Remember it'll change your current Bash session interpreter for "spaces".

edited May 28 '21 at 23:56

Peter Mortensen

30,738
21
105
131

answered Aug 28 '19 at 22:02

danilo

7,680
7
43
46

hello, the script fails when filename has space, anyway to fix that? – jerry Mar 06 '21 at 09:52
1

yes, you should use IFS (Internal Field Separator ) type `IFS=$'\n'` before using the script: https://askubuntu.com/a/344418/734218 – danilo Mar 06 '21 at 14:45
Clean and simple, thanks a lot! – SRG Feb 02 '23 at 15:38
See https://mywiki.wooledge.org/BashFAQ/020 for a more thorough exposition of how to work with file names with spaces. This answer would benefit from refactoring it for robustness; `find . -exec file -i {} +` – tripleee Mar 24 '23 at 17:43
Note that using `find ... -exec` can eat error codes on you, so if you plan to do anything complex per result found, make sure you put in `|| { error ; handling ; }` that lets you notice fatal errors in your -exec script body. I prefer piping the results of find into a while loop, so I can better handle errors / use `set -o errexit` / `set -e`. – Ajax Jul 10 '23 at 23:30

score 7 · Answer 8 · edited May 28 '21 at 23:42

7

It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.

Encoding is one of the hardest things to do, because you never know if nothing is telling you.

edited May 28 '21 at 23:42

Peter Mortensen

30,738
21
105
131

answered Apr 30 '09 at 07:13

Norbert Hartl

10,481
5
36
46

It may help to try to brute force. The following command will try to convert from all ecncoding formats with names that start with WIN or ISO into UTF8. Then one would need to manually check the output searching for a clue into the right encoding. Of course, you can change the filtered formats replacing ISO or WIN for something appropriate or remove the filter by removing the grep command. for i in $(iconv -l | tail -n +2 | grep "$^ISO\|^WIN$" | sed -e 's/\/\///'); do echo $i; iconv -f $i -t UTF8 santos ; done; – ndvo Jan 16 '20 at 21:21

score 6 · Answer 9 · edited May 28 '21 at 23:42

6

With Python, you can use the chardet module.

edited May 28 '21 at 23:42

Peter Mortensen

30,738
21
105
131

answered Jul 18 '11 at 14:55

fccoelho

6,012
10
55
67

chardet reports "None", chardet3 chokes on the first line of the file in the **exact** same way that my python script does. – Joels Elf May 30 '16 at 02:37

score 4 · Answer 10 · edited May 28 '21 at 23:54

4

In PHP you can check it like below:

Specifying the encoding list explicitly:

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching. To have a more accurate result, you can use all possible encodings via: mb_list_encodings()

Note the mb_* functions require php-mbstring:

apt-get install php-mbstring

edited May 28 '21 at 23:54

Peter Mortensen

30,738
21
105
131

answered Jul 12 '19 at 16:08

Mohamed23gharbi

1,710
23
28

`php7.4-cli` is also needed to have `php` on command-line. [Try this](https://stackoverflow.com/questions/68427225/attempting-to-run-php-returns-php-symbol-lookup-error-php-undefined-symbol-p#comment123791737_68427284) if you still have the pcre2_set_depth_limit_8 error – Pablo Bianchi Sep 23 '22 at 05:56
@PabloBianchi yes php cli is required, and I supposed that it should already be installed to be able to run php -r . but with only php installed you will not able use mb_detect without installing php-mbstring. Anyway you are right as the question is généric, i'll edit it to add php install – Mohamed23gharbi Sep 23 '22 at 09:51

score 3 · Answer 11 · edited May 28 '21 at 23:38

This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.

Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.

So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.

I'm not talking about literal translation such as:

English   French
-------   ------
of        de, du
and       et
the       le, la, les

although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).

score 2 · Answer 12 · answered May 30 '12 at 18:18

I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)

python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt

score 2 · Answer 13 · edited May 28 '21 at 23:43

2

If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?> So, you can use regular expressions (e.g., with Perl) to check every file for such specification.

More information can be found here: How to Determine Text File Encoding.

edited May 28 '21 at 23:43

Peter Mortensen

30,738
21
105
131

answered Jan 27 '12 at 14:31

evgeny9

1,381
3
17
31

well that line could be copy-n-pasted by someone who doesn't know what encoding he's using. – Algoman May 03 '18 at 08:38
Word of caution, nothing about the declaration at the top guarantees the file ACTUALLY is encoded that way. If you really, really care about the encoding you need to validate it yourself. – Jazzepi Aug 19 '19 at 13:24

score 1 · Answer 14 · edited May 28 '21 at 23:52

I am using the following script to

Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups

#!/bin/bash -xe

SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"

echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')

for FILE in $FOUND_FILES ; do
    ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
    echo "Backup original file to $ORIGINAL_FILE"
    mv "$FILE" "$ORIGINAL_FILE"

    echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
    iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done

echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;

score 1 · Answer 15 · answered Nov 03 '21 at 08:34

I was working in a project that requires cross-platform support and I encounter many problems related with the file encoding.

I made this script to convert all to utf-8:

#!/bin/bash
## Retrieve the encoding of files and convert them
for f  `find "$1" -regextype posix-egrep -regex ".*\.(cpp|h)$"`; do
  echo "file: $f"
  ## Reads the entire file and get the enconding
  bytes_to_scan=$(wc -c < $f)
  encoding=`file -b --mime-encoding -P bytes=$bytes_to_scan $f`
  case $encoding in
    iso-8859-1 | euc-kr)
    iconv -f euc-kr -t utf-8 $f > $f.utf8
    mv $f.utf8 $f
    ;;
  esac
done

I used a hack to read the entire file and estimate the file encoding using file -b --mime-encoding -P bytes=$bytes_to_scan $f

score 0 · Answer 16 · edited May 28 '21 at 23:50

In Cygwin, this looks like it works for me:

find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done

Example:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done

You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.

Example:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash

score 0 · Answer 17 · answered Apr 05 '18 at 17:08

0

You can extract encoding of a single file with the file command. I have a sample.html file with:

$ file sample.html

sample.html: HTML document, UTF-8 Unicode text, with very long lines

$ file -b sample.html

HTML document, UTF-8 Unicode text, with very long lines

$ file -bi sample.html

text/html; charset=utf-8

$ file -bi sample.html  | awk -F'=' '{print $2 }'

utf-8

answered Apr 05 '18 at 17:08

Daniel Faure

391
6
14

1

the output I get is just "regular file" – Mordechai May 11 '18 at 04:58

Henke · Answer 18 · 2023-03-28T12:40:17.227

0

The file command is not able to do this.

– Both yes and no. The following will do the job, but isn't completely reliable : ¹

file -i * | grep -v iso-8859-1

It returns the non-ISO-8859-1 encoded files in the current directory – the ones you want to move.

^{¹
There is a caveat, which has to do with the file command not
being reliable.
In short, as long as each file is smaller than 64 kB (< 63 KiB),
my solution here should be fine.
But for files larger than 64 kB, you cannot trust it.
There is some probability (maybe small but still positive), that my
solution falsely reports some non-ASCII files as being pure
ASCII.

The risk increases if you have very few non-ASCII characters in
"large" files.

To reproduce, the command

dd if=/dev/zero bs=64000 count=1 | tr '\0' 'a' | fold >/tmp/demo64k; echo $'\xff' >>/tmp/demo64k && file -i /tmp/demo64k

creates the file /tmp/demo64k, which has the non-ASCII character ÿ
as its last character.

The file command correctly identifies /tmp/demo64k as an
ISO-8859-1 encoded file.

By contrast, the command

dd if=/dev/zero bs=65000 count=1 | tr '\0' 'a' | fold >/tmp/demo65k; echo $'\xff' >>/tmp/demo65k && file -i /tmp/demo65k

creates the file /tmp/demo65k, which also has the non-ASCII
character ÿ as its last character.

But this time, the file command falsely identifies
/tmp/demo65k as an ASCII-encoded file.

I attribute this comment
for pointing this out to me.
Read the comments below this post if you want more details!}

edited Mar 28 '23 at 12:40

answered Mar 15 '23 at 16:25

Henke

4,445
3
31
44

1

`file` is not reliable; it only examines the beginning of the file. It will often identify files as ASCII if they only contain a few non-ASCII characters, and then of course cannot reliably identify the actual encoding between various legacy 8-bit encodings. – tripleee Mar 24 '23 at 17:40
@tripleee, *It will often identify files as ASCII if they only contain a few non-ASCII characters* I find that super interesting. Could you give me an example file, containing at least *one* non-ASCII character, such that the `file` command identifies it as a pure ASCII file? – Henke Mar 27 '23 at 11:04
1

`dd if=/dev/zero bs=63336 count=2 | tr '\0' 'a' | fold >/tmp/test; printf '\xff' >>/tmp/test`. With `count=1` I can't repro so the boundary is somewhere between 64k+1 and 128k+1. – tripleee Mar 27 '23 at 11:12
Wow. Thanks! Yes, I can confirm. For `$ dd if=/dev/zero bs=64000 count=1 | tr '\0' 'a' | fold >/tmp/test; echo $'\xff' >>/tmp/test`, the command `$ file -i /tmp/test` *correctly* reports *iso-8859-1*. But for `$ dd if=/dev/zero bs=65000 count=1 | tr '\0' 'a' | fold >/tmp/test; echo $'\xff' >>/tmp/test`, it ***falsely*** reports *us-ascii*. This is good to know. Thanks! – Henke Mar 27 '23 at 12:41
1

Oh, I see now that I misspelled 65536 (-: So the actual boundary is apparently slightly less than 64k. – tripleee Mar 27 '23 at 12:43
If I allow myself to play the role of the devil's attorney, it's actually slightly below 64 *KiB*, but still above 64 kB. :-) But never mind the details. I'm happy you pointed this out! Do you have any good (online) references about the `file` command being unreliable? – Henke Mar 27 '23 at 12:52
1

Not really, just practical experience, and a basic understanding of how it works. – tripleee Mar 27 '23 at 14:41

score -3 · Answer 19 · answered Jan 23 '12 at 10:14

-3

With Perl, use Encode::Detect.

answered Jan 23 '12 at 10:14

manu_v

1,248
2
12
21

8

Can you give an example how to use it in the shell? – Lri May 01 '12 at 00:08
2

Another poster (@fccoelho) provided a Python module as a solution that gets a +3 and this poster gets a -2 for a very very similar answer except that it is for a Perl module. Why the double standard?! – Happy Green Kid Naps Sep 09 '16 at 19:17
6

Maybe a code example of a perl one-liner would help this answer. – vikingsteve Oct 20 '16 at 08:47

How can I find encoding of a file via a script on Linux?

19 Answers19

Linked