LInux shell: conditional conversion of character encoding, multiple text files

Question

The situation: I have a bunch of text files (.csv, to be precise), around 20000 that differ in character encoding: file -i *.csv gives me charset=us-ascii for most, but some are utf-16le.

The goal: I want them all to be encoded the same way, us-ascii here. I think of a one-liner that checks for each file in the directory the encoding, and if it is utf-16le, it converts it to us-ascii.

I only started to learn bash programming a few day ago, so this one still escapes me. Is it possible, something like running file -i on each file (did that), capturing the return value, check what encoding is given and if it is not us-ascii, convert it?

Thanks for helping me understand how to do that!

flaschenpost · Answer 1 · 2013-05-13T06:18:22.950

The other solutions don't care about the mixture of files, which sounds like a solution in the sense of:

for F in *.csv; do
    if [ `file -i "$F" | awk '{print $3;}'` = "charset=utf-16" ]; then
        iconv -f UTF-16 -t US-ASCII "$F" > "u.$F"
    fi
done

What makes it easier is the identity of us-ascii and utf-16 in the first few (128) characters - so if the file really is us-ascii, the conversion would not do any harm.

score 1 · Answer 2 · edited May 23 '17 at 12:15

1

Pls try the following command:

iconv -f FROM-ENCODING -t TO-ENCODING *.csv

and replace FROM-ENCODING and TO-ENCODING with appropriate values.

You can use the following script, or something similar for your needs.

for file in  *.csv
do
    iconv -f FROM-ENCODING -t TO-ENCODING "$file" > "$file.new"
done

You can also use recode command.

recode FROM-ENCODING..TO-ENCODING file.csv

Finally, look at this Best way to convert text files between character sets? if you are interested in learning more about iconv and/or recode

edited May 23 '17 at 12:15

Community

1
1

answered May 12 '13 at 21:03

Bill

5,263
6
35
50

Parsing the output of ls is harmful, use globbing. – Adrian Frühwirth May 12 '13 at 22:12
@AdrianFrühwirth Yes, when filenames have spaces, this can be a problem....thanks. – Bill May 12 '13 at 22:15
You also need to quote your variables, otherwise it doesn't fix anything ;-) – Adrian Frühwirth May 12 '13 at 22:18
I only see $file, but if there are others feel free to quote them as well. Always quote, especially when dealing with filenames. – Adrian Frühwirth May 12 '13 at 22:24
Sorry, but you are wrong. Try yourself with a file that contains spaces. – Adrian Frühwirth May 12 '13 at 22:29
1

@AdrianFrühwirth Thanks a lot for helping to improve the answer :) – Bill May 12 '13 at 22:35

rzymek · Answer 3 · 2013-05-13T07:28:27.610

1

This will convert any non-us-ascii encoded *.csv files to us-ascii:

#!/bin/bash
for f in *.csv;do
    charset=`file -i README.md |grep -o 'charset=.*'|cut -d= -f2`
    if [ "$charset" != "us-ascii" ];then
      echo "$f $charset -> us-ascii"
      iconv -f "$charset" -t us-ascii < "$f" > "$f.tmp" \
        && mv "$f.tmp" "$f"
    fi
done

edited May 13 '13 at 07:28

answered May 12 '13 at 21:17

rzymek

9,064
2
45
59

Please quote your variables to account for spaces in filenames. – Adrian Frühwirth May 12 '13 at 22:36

LInux shell: conditional conversion of character encoding, multiple text files

3 Answers3