2

I have a bunch of text files with different encodings. But I want to convert all of the into utf-8. since there are about 1000 files, I cant do it manually. I know that there are some commands in llinux which change the encodings of files from one encoding into another one. but my question is how to automatically detect the current encoding of a file? Clearly I'm looking for a command (say FindEncoding($File) ) to do this:

foreach file
do
$encoding=FindEncoding($File);
uconv -f $encoding -t utf-8 $file;
done
Hakim
  • 11,110
  • 14
  • 34
  • 37

1 Answers1

5

I usually do sth like this:

for f in *.txt; do
    encoding=$(file -i "$f" | sed "s/.*charset=\(.*\)$/\1/")
    recode $encoding..utf-8 "$f"
done

Note that recode will overwrite the file for changing the character encoding. If it is not possible to identify the text files by extension, their respective mime type can be determined with file -bi | cut -d ';' -f 1.

It is also probably a good idea to avoid unnecessary re-encodings by checking on UFT-8 first:

if [ ! "$encoding" = "utf-8" ]; then
    #encode

After this treatment, there might still be some files with an us-ascii encoding. The reason for that is ASCII being a subset of UTF-8 that remains in use unless any characters are introduced that are not expressible by ASCII. In that case, the encoding switches to UTF-8.

J. Katzwinkel
  • 1,923
  • 16
  • 22