Removing diacritical marks from a Greek text in an automatic way

Question

I have a decompiled stardict dictionary in the form of a tab file

κακός <tab> bad

where <tab> signifies a tabulation.

Unfortunately, the way the words are defined requires the query to include all diacritical marks. So if I want to search for ζῷον, I need to have all the iotas and circumflexes correct.

Thus I'd like to convert the whole file so that the keyword has the diacritic removed. So the line would become

κακος <tab> <h3>κακός</h3> <br/> bad

I know I could read the file line by line in bash, as described here [1]

while read line           
do           
    command           
done <file

But what is there any way to automatize the operation of converting the line? I heard about iconv [2] but didn't manage to achieve the desired conversion using it. I'd best like to use a bash script.

Besides, is there an automatic way of transliterating Greek, e.g. using the method Perseus has?

Perseus' way of doing it

/edit: Maybe we could use the Unicode codes? We can notice that U+1F0x, U+1F8x for x < 8, etc. are all variants of the letter α. This would reduce the amount of manual work. I'd accept a C++ solution as well.

[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all of the diacritics from a file?

Who is Perseus? I mean, I know who he *is*, but is he enough of a demigod to have a certain method attributed to him? — Jongware, May 22 '15 at 10:01
I'm not sure what you mean by "didn't make use of it". Are you saying you couldn't get `iconv` to do the conversion you need, OR do you mean that for some reason, you don't have `iconv` installed (and can't get it installed)? (Please update your question with this info rather than responding here and I'll delete this). Good luck. — shellter, May 22 '15 at 10:47
Regarding transliterating the Greek: that image is intended to help the user type in Greek on that site using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc. — James Webster, May 22 '15 at 10:57

score 2 · Answer 1 · edited May 23 '17 at 12:22

2

I'm not so familiar with Ancient Greek as I am with Modern Greek (which only really uses two diacritics)

However I went through the vowels and found out which combined with diacritics. This gave me the following list:

ἆἂᾶὰάἀἄ 
ἒὲέἐἔ 
ἦἢῆὴήἠἤ 
ἶἲῖὶίἰἴ 
ὂὸόὀὄ 
ὖὒῦὺύὐὔ 
ὦὢῶὼώὠὤ

I saved this list as a file and passed it to this sed

cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'

^{Credit to hungnv}

It's a simple sed. It takes each of the options and replaces it with the unmarked character. The result of the above command is:

ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω

Regarding transliterating the Greek: the image from your post is intended to help the user type in Greek on the site you took it from using similar glyphs, not always similar sounds. Those are poor transliterations. e.g. β is most often transliterated as v. ψ is ps. φ is ph, etc.

edited May 23 '17 at 12:22

Community

1
1

answered May 22 '15 at 11:16

James Webster

31,873
11
70
114

James Webster: And what about ὧ? There are two spiritus, 3 accents and the subscript iota. For each possibly long vowel (all but ο and ε) we have 3*4*2 - 1 = 23 options. I'd rather not define this all by hand. The transliteration is meant to give 1:1 transliteration (one Latin char per one Greek char). Besides in Ancient Greek β was pronounced as b ;) Hence barbarian (βάρβαρος) and not varvarian :) But this is only an example, the key point is in __automatic__ – marmistrz May 22 '15 at 12:58
I was just using the polytonic keyboard and pressing key combinations. If the combination made a letter.. I kept it. – James Webster May 22 '15 at 13:22
1

Regarding the different options.. you would only need to input these combinations once. It seems easier than creating a program to create the combinations for you. Even if all 6 vowels had 23 options, that's only 138 characters you have to type. – James Webster May 22 '15 at 13:26
No, 2 times more. Since there are capitals too. – marmistrz May 25 '15 at 15:52
That's still fewer characters than your have to type to write a program to generate those characters.. Maybe. This would make a good question on PGC – James Webster May 25 '15 at 16:28
Can you give me an example for alpha and omicron which diacritics are valid please? – James Webster May 25 '15 at 16:30
I guess that replacing basing on Unicode codes should be more efficient to code. But I don't really know how to do it in a sensible manner. http://unicode.org/charts/PDF/U1F00.pdf – marmistrz May 25 '15 at 17:50
This misses quite a few. – Lance Mar 13 '20 at 06:11

score 2 · Accepted Answer · answered May 26 '15 at 09:38

You can remove diacritics from a string relatively easily using Perl:

$_=NFKD($_);s/\p{InDiacriticals}//g;

for example:

$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω

This works as follows:

The -CS enables UTF8 for Perl's stdin/stdout
The -MUnicode::Normalize loads a library for Unicode normalisation
-e executes the script from the command line; -n automatically loops over lines in the input; -p prints the output automatically
NFKD() translates the line into one of the Unicode normalisation forms; this means that accents and diacritics are decomposed into separate characters, which makes it easier to remove them in the next step
s/\p{InDiacriticals}//g removes all characters that Unicoded denotes as diacritical marks

This should in fact work for removing diacritics etc for all scripts/languages that have good Unicode support, not just Greek.

Removing diacritical marks from a Greek text in an automatic way

2 Answers2