How to remove all of the diacritics from a file?

Question

I have a file containing many vowels with diacritics. I need to make these replacements:

Replace ā, á, ǎ, and à with a.
Replace ē, é, ě, and è with e.
Replace ī, í, ǐ, and ì with i.
Replace ō, ó, ǒ, and ò with o.
Replace ū, ú, ǔ, and ù with u.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
Replace Ā, Á, Ǎ, and À with A.
Replace Ē, É, Ě, and È with E.
Replace Ī, Í, Ǐ, and Ì with I.
Replace Ō, Ó, Ǒ, and Ò with O.
Replace Ū, Ú, Ǔ, and Ù with U.
Replace Ǖ, Ǘ, Ǚ, and Ǜ with Ü.

I know I can replace them one at a time with this:

sed -i 's/ā/a/g' ./file.txt

Is there a more efficient way to replace all of these?

sed is possibly not the best tool for this job; iconv is probably better. see: http://stackoverflow.com/questions/8562354/remove-unicode-characters-from-textfiles-sed-other-bash-shell-methods — Wooble, Apr 18 '12 at 10:26

score 79 · Accepted Answer · edited Sep 21 '20 at 10:07

79

If you check the man page of the tool iconv:

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

so we could do :

kent$  cat test1
    Replace ā, á, ǎ, and à with a.
    Replace ē, é, ě, and è with e.
    Replace ī, í, ǐ, and ì with i.
    Replace ō, ó, ǒ, and ò with o.
    Replace ū, ú, ǔ, and ù with u.
    Replace ǖ, ǘ, ǚ, and ǜ with ü.
    Replace Ā, Á, Ǎ, and À with A.
    Replace Ē, É, Ě, and È with E.
    Replace Ī, Í, Ǐ, and Ì with I.
    Replace Ō, Ó, Ǒ, and Ò with O.
    Replace Ū, Ú, Ǔ, and Ù with U.
    Replace Ǖ, Ǘ, Ǚ, and Ǜ with U.


kent$  iconv -f utf8 -t ascii//TRANSLIT test1
    Replace a, a, a, and a with a.
    Replace e, e, e, and e with e.
    Replace i, i, i, and i with i.
    Replace o, o, o, and o with o.
    Replace u, u, u, and u with u.
    Replace u, u, u, and u with u.
    Replace A, A, A, and A with A.
    Replace E, E, E, and E with E.
    Replace I, I, I, and I with I.
    Replace O, O, O, and O with O.
    Replace U, U, U, and U with U.
    Replace U, U, U, and U with U.

edited Sep 21 '20 at 10:07

beardhatcode

4,533
1
16
29

answered Apr 18 '12 at 10:35

Kent

189,393
32
233
301

4

This works well, except I only want the marks to disappear from the ü, but not the umlaut. – Village Apr 18 '12 at 11:07
Kent, I wanted to add a direct link for "the" man page for `iconv` -- but none of the ones I found contained that particular quote. Would you like to add where you got it from? – Jongware May 22 '15 at 10:04
1

from `man iconv`. In answer I also mentioned man page of iconv. My current version is `iconv (GNU libc) 2.21` But the answer was posted 3 years ago, I don't know which version I had then. @Jongware – Kent May 22 '15 at 10:36
19

`echo 'á' | iconv -f utf8 -t ascii//TRANSLIT` gives me `'a` instead of `a` on macOS default iconv (GNU libiconv 1.11) – nloveladyallen Dec 05 '17 at 23:30
A side note on this answer: Check the character set of the target file when you get the _iconv: illegal input sequence at position ..._ error. Suppose you export a CSV file from Microsoft Excel, run `file -i test2.csv` and see `charset=iso-8859-1`, then use `-f iso-8859-1` instead of `-f utf8`. – Culip Jul 02 '20 at 10:01

score 19 · Answer 2 · answered Apr 18 '12 at 13:30

19

This might work for you:

sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/' file

answered Apr 18 '12 at 13:30

potong

55,640
6
51
83

Interestingly if you are on Mac you will have to add the -e flag to the command line. More infos : http://stackoverflow.com/questions/16745988/sed-command-works-fine-on-ubuntu-but-not-mac – Mr Washington Sep 15 '16 at 14:31
2

macosx: `sed -e 'y/āáǎàçēéěèīíǐìōóǒòūúǔùǖǘǚǜüĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛÜ/aaaaceeeeiiiioooouuuuuuuuuAAAAEEEEIIIIOOOOUUUUUUUUU/' file` Note: for my need, I didn't keep the ü character. – leontalbot Apr 23 '18 at 15:14
1

The advantage with "sed" is it's almost everywhere. Just an improved version: `-e 'y/āáǎàēéěèīíǐìïōóǒòöūúǔùǖǘǚǜüĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛÜÇçÑñ/aaaaeeeeiiiiiooooouuuuuuuuuAAAAEEEEIIIIOOOOUUUUUUUUUCcNn/'` – ATorras Mar 18 '21 at 19:32
1

Added the circumflex accent, used in French. Like `ê`. `'y/āáǎàâēéěèêīíǐìïîōóǒòöôūúǔùǖǘǚǜüûĀÁǍÀĒÉĚÈÊĪÍǏÌÎŌÓǑÒÔŪÚǓÙǕǗǙǛÜÛÇçÑñ/aaaaaeeeeeiiiiiioooooouuuuuuuuuuAAAAEEEEEIIIIIOOOOOUUUUUUUUUUCcNn/'` – Nic3500 Dec 08 '22 at 18:48
1

Added the trema accent from @Nic3500, used in French. Like ê.
'y/āáǎàâēéěèêêīíǐìïîōóǒòöôūúǔùǖǘǚǜüûĀÁǍÀĒÉĚÈÊËĪÍǏÌÎŌÓǑÒÔŪÚǓÙǕǗǙǛÜÛÇçÑñ/aaaaaeeeeeeiiiiiioooooouuuuuuuuuuAAAAEEEEEEIIIIIOOOOOUUUUUUUUUUCcNn/' – FrViPofm May 02 '23 at 08:44
Hum... Reorganized :
`string1="āáǎàâäçēéěèêëīíǐìîïñōóǒòôöūúǔùûǖǘǚǜĀÁǍÀÂÄÇĒÉĚÈÊËĪÍǏÌÎÏŌÓǑÒÔÖŪÚǓÙǕǗǙǛ"`
`string2="aaaaaaçeeeeeeiiiiiinoooooouuuuuuuuuAAAAAACEEEEEEIIIIIINOOOOOOUUUUUUU"`
test:
`echo $(echo $string1 | sed -e"y/$string1/$string2/" )` – FrViPofm May 02 '23 at 09:14

Fedir RYKHTIK · Answer 3 · 2013-09-03T06:32:41.430

14

I like iconv as it handles all accents variations :

cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt

edited Sep 03 '13 at 06:32

answered Sep 02 '13 at 15:56

Fedir RYKHTIK

9,844
6
58
68

2

This converted `Ángel` into `'angel` for me. :( – Heath Borders Aug 27 '21 at 15:53
For me too, but I prefer this solution rather than the other after removing the non-ASCII letters, like adding a sed command as "s/[^a-zA-Z]//g". becoming: cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE | sed "s/[^a-zA-Z]//g" > ascii.txt – Djeefther Souza Feb 16 '22 at 05:08

score 2 · Answer 4 · answered Apr 18 '12 at 10:27

2

For this the tr(1) command is for. For example:

tr 'āáǎàēéěèīíǐì...' 'aaaaeeeeiii...' <infile >outfile

You may have to check/change your LANG environment variable to match the character set being used.

answered Apr 18 '12 at 10:27

ktf

6,865
1
13
6

score 2 · Answer 5 · answered Nov 22 '19 at 08:58

#!/bin/bash
INPUT="$1"
declare -a acc
declare -a noa
acc=('$' 'Ã¨' 'Ãª' 'Ã©' 'À' 'Á' 'Â' 'Ã' 'Ä' 'Å' 'Æ' 'Ç' 'È' 'É' 'Ê' 'Ë' 'Ì' 'Í' 'Î' 'Ï' 'Ð' 'Ñ' 'Ò' 'Ó' 'Ô' 'Õ' 'Ö' 'Ø' 'Ù' 'Ú' 'Û' 'Ü' 'Ý' 'ß' 'à' 'á' 'â' 'ã' 'ä' 'å' 'æ' 'ç' 'è' 'é' 'ê' 'ë' 'ì' 'í' 'î' 'ï' 'ñ' 'ò' 'ó' 'ô' 'õ' 'ö' 'ø' 'ù' 'ú' 'û' 'ü' 'ý' 'ÿ' 'Ā' 'ā' 'Ă' 'ă' 'Ą' 'ą' 'Ć' 'ć' 'Ĉ' 'ĉ' 'Ċ' 'ċ' 'Č' 'č' 'Ď' 'ď' 'Đ' 'đ' 'Ē' 'ē' 'Ĕ' 'ĕ' 'Ė' 'ė' 'Ę' 'ę' 'Ě' 'ě' 'Ĝ' 'ĝ' 'Ğ' 'ğ' 'Ġ' 'ġ' 'Ģ' 'ģ' 'Ĥ' 'ĥ' 'Ħ' 'ħ' 'Ĩ' 'ĩ' 'Ī' 'ī' 'Ĭ' 'ĭ' 'Į' 'į' 'İ' 'ı' 'Ĳ' 'ĳ' 'Ĵ' 'ĵ' 'Ķ' 'ķ' 'Ĺ' 'ĺ' 'Ļ' 'ļ' 'Ľ' 'ľ' 'Ŀ' 'ŀ' 'Ł' 'ł' 'Ń' 'ń' 'Ņ' 'ņ' 'Ň' 'ň' 'ŉ' 'Ō' 'ō' 'Ŏ' 'ŏ' 'Ő' 'ő' 'Œ' 'œ' 'Ŕ' 'ŕ' 'Ŗ' 'ŗ' 'Ř' 'ř' 'Ś' 'ś' 'Ŝ' 'ŝ' 'Ş' 'ş' 'Š' 'š' 'Ţ' 'ţ' 'Ť' 'ť' 'Ŧ' 'ŧ' 'Ũ' 'ũ' 'Ū' 'ū' 'Ŭ' 'ŭ' 'Ů' 'ů' 'Ű' 'ű' 'Ų' 'ų' 'Ŵ' 'ŵ' 'Ŷ' 'ŷ' 'Ÿ' 'Ź' 'ź' 'Ż' 'ż' 'Ž' 'ž' 'ſ' 'ƒ' 'Ơ' 'ơ' 'Ư' 'ư' 'Ǎ' 'ǎ' 'Ǐ' 'ǐ' 'Ǒ' 'ǒ' 'Ǔ' 'ǔ' 'Ǖ' 'ǖ' 'Ǘ' 'ǘ' 'Ǚ' 'ǚ' 'Ǜ' 'ǜ' 'Ǻ' 'ǻ' 'Ǽ' 'ǽ' 'Ǿ' 'ǿ');
noa=('S' 'e' 'e' 'e' 'A' 'A' 'A' 'A' 'A' 'A' 'AE' 'C' 'E' 'E' 'E' 'E' 'I' 'I' 'I' 'I' 'D' 'N' 'O' 'O' 'O' 'O' 'O' 'O' 'U' 'U' 'U' 'U' 'Y' 's' 'a' 'a' 'a' 'a' 'a' 'a' 'ae' 'c' 'e' 'e' 'e' 'e' 'i' 'i' 'i' 'i' 'n' 'o' 'o' 'o' 'o' 'o' 'o' 'u' 'u' 'u' 'u' 'y' 'y' 'A' 'a' 'A' 'a' 'A' 'a' 'C' 'c' 'C' 'c' 'C' 'c' 'C' 'c' 'D' 'd' 'D' 'd' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'E' 'e' 'G' 'g' 'G' 'g' 'G' 'g' 'G' 'g' 'H' 'h' 'H' 'h' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'I' 'i' 'IJ' 'ij' 'J' 'j' 'K' 'k' 'L' 'l' 'L' 'l' 'L' 'l' 'L' 'l' 'l' 'l' 'N' 'n' 'N' 'n' 'N' 'n' 'n' 'O' 'o' 'O' 'o' 'O' 'o' 'OE' 'oe' 'R' 'r' 'R' 'r' 'R' 'r' 'S' 's' 'S' 's' 'S' 's' 'S' 's' 'T' 't' 'T' 't' 'T' 't' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'W' 'w' 'Y' 'y' 'Y' 'Z' 'z' 'Z' 'z' 'Z' 'z' 's' 'f' 'O' 'o' 'U' 'u' 'A' 'a' 'I' 'i' 'O' 'o' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'U' 'u' 'A' 'a' 'AE' 'ae' 'O' 'o');

i=0
length=${#INPUT}
while [[ $i -lt $length ]]; do
    char=${INPUT:$i:1};
    #echo $i:$char
    j=0
    for letter in "${acc[@]}"
    do
        if [[ "$letter" == "$char" ]]; then
            char="${noa[$j]}"
        fi
        ((j++))
    done
    ((i++))
    OUTPUT=$OUTPUT$char
done
echo $OUTPUT

How can this read from a file ? Thanks – jat Dec 08 '21 at 15:40 — jat, Dec 08 '21 at 15:40

score 1 · Answer 6 · answered Apr 18 '12 at 10:36

1

You can use something like this:

  sed -e 's/[àâ]/a/g;s/[ọõ]/o/g;s/[í,ì]/i/g;s/[ê,ệ]/e/g'

just add more characters to [..] for your need.

answered Apr 18 '12 at 10:36

hungnv

152
8

score 1 · Answer 7 · answered Jun 29 '16 at 16:05

If you, like me, need to replace the accents just in some special places of your file text, you can do that using this kind of regex

echo '{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"bábögêjírù","anotherNotReplaceKey":"bábögêjírù"}' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[áâàãä]/replaceValueKey":"\1a/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[éêèë]/replaceValueKey":"\1e/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[íîìï]/replaceValueKey":"\1i/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[óôòõö]/replaceValueKey":"\1o/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[úûùü]/replaceValueKey":"\1u/g;ta'

Output

{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"babogejiru","anotherNotReplaceKey":"bábögêjírù"}

score 1 · Answer 8 · answered Jul 09 '16 at 21:57

1

You can use man iso_8859_1 (or your char set) or od -bc to identify the the octal representation of the diacritic. Then use gawk to do the replacing.

{ gsub(/\344/,"a"; print $0 }

This replaces ä with a.

answered Jul 09 '16 at 21:57

Rich Traube

161
1
6

score 0 · Answer 9 · answered Dec 02 '13 at 16:23

0

This may not work. Just because your locale must be set!

use locale to set LC_ALL, for example:

export LC_ALL=en_US.iso88591

Note that the full list of locales is available through:

locale -a

answered Dec 02 '13 at 16:23

Bruno

56
3

score 0 · Answer 10 · edited Jun 13 '22 at 01:12

0

If you want to know which solution is the fastest:

Text Transliteration: using tr : 5.3 MB/s

Text Transliteration: using sed: 70.3 MB/s

Text Transliteration: using iconv: 35.2 MB/s

So the sed 'y/[diacritics]/[transliterated]/' command is the fastest by far!

(code on github.com/pforret/bash_benchmarks )

edited Jun 13 '22 at 01:12

devstuff

8,277
1
27
33

answered Mar 20 '22 at 21:46

Peter Forret

19
2

How to remove all of the diacritics from a file?

10 Answers10

Linked