How to generate diacritized vowel table automatically?

Question

I want to have the table of vowels with diacritics, but don't want to search symbol tables manually.

Is it possible to generate this table by crossing the list of vowels and the list of diacritics in some of the following languages: Java, PHP, Wolfram Mathematica, .NET languages and so on?

I need to have characters (unicode) as output.

Java Solution

I found that there are a special Unicode feature for this: http://en.wikipedia.org/wiki/Unicode_normalization

Java supports it since 1.6 http://docs.oracle.com/javase/6/docs/api/java/text/Normalizer.html

So, the sample code is:

public static void main(String[] args) {
    String vowels = "aeiou";
    char[] diacritics = {'\u0304', '\u0301', '\u0300', '\u030C'};
    StringBuilder sb = new StringBuilder();

    for(int v=0; v<vowels.length(); ++v) {
        for(int d=0; d<diacritics.length; ++d) {
            sb.append(vowels.charAt(v));
            sb.append(diacritics[d]);

            sb.append(' ');
        }
        sb.append(vowels.charAt(v));
        sb.append('\n');
    }

    String ans = Normalizer.normalize(sb.toString(), Normalizer.Form.NFC);

    JOptionPane.showMessageDialog(null, ans);
}

I.e. we just put combining diacritics after vowels and then apply normalization to the string.

You can try to extract the information from http://unicode.org/Public/UNIDATA/NamesList.txt I assume you only want Roman letters. Anything matching LATIN SMALL|CAPITAL LETTER A|E|I|O|U WITH should be relevant. I do not know how robust this is and if you want things like ø. Also, beware that Mathematica doesn't properly support Unicode outside the Basic Multilingual Plane: http://stackoverflow.com/questions/5597013/reading-an-utf-8-encoded-text-file-in-mathematica — Szabolcs, Jan 08 '12 at 12:46
Also, what about things like æ? Do you consider it a vowel (it certainly is in Norwegian) or not? — Szabolcs, Jan 08 '12 at 12:49

score 4 · Accepted Answer · edited Jan 13 '12 at 00:50

To be honest, I haven't completely deciphered what Szabolcs' code is doing, but in this particular case this seems to produce the same result in Mathematica using slightly less code

data = Import["http://unicode.org/Public/UNIDATA/NamesList.txt", "Lines"];

codes = Cases[data, 
 b_String /; StringMatchQ[
  b, ___ ~~ "LATIN " ~~ "CAPITAL" | "SMALL" ~~ " LETTER " ~~ 
   "A" | "E" | "I" | "O" | "U" ~~ " WITH " ~~ ___] :> 
    FromDigits[StringTake[b, 4], 16], Infinity];

FromCharacterCode[codes]

which produces

"ÀÁÂÃÄÅÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜàáâãäåèéêëìíîïòóôõöøùúûüĀāĂăĄąĒēĔĕĖėĘęĚěĨĩĪīĬ\
ĭĮįİŌōŎŏŐőŨũŪūŬŭŮůŰűŲųƗƟƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǞǟǠǡǪǫǬǭǺǻǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍ\
ȎȏȔȕȖȗȦȧȨȩȪȫȬȭȮȯȰȱȺɆɇɨᶏᶒᶖᶙḀḁḔḕḖḗḘḙḚḛḜḝḬḭḮḯṌṍṎṏṐṑṒṓṲṳṴṵṶṷṸṹṺṻẚẠạẢảẤấẦầẨ\
ẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮ\
ữỰựⱥⱸⱺꝊꝋꝌꝍ"

Szabolcs · Answer 2 · 2012-01-08T13:25:43.713

I digged up some old Mathematica code I had, I'm going to paste it here. You can use it any way to please. Bugs to be expected!

uninames = 
  StringSplit[
   Import["http://unicode.org/Public/UNIDATA/NamesList.txt", "Text"], 
   "\n"];

uniNameList = ({ToExpression["16^^" <> First[#]], 
      StringJoin@Riffle[Rest[#], "\n"]} & /@ 
    DeleteCases[
     Flatten /@ 
       Split[StringSplit[#, "\t" .., All] & /@ Take[uninames, All], 
        First[#2] === "" &] /. "" -> Sequence[], 
     x_ /; StringTake[First[x], 1] === "@"]);

uniRangeList = {FromDigits[#1, 16], 
     FromDigits[#3, 15], #2} & @@@ (Rest /@ 
     Select[StringSplit[#, "\t"] & /@ uninames, First[#] == "@@" &]);

Clear[unicodeName]
Set[unicodeName[#1], #2] & @@@ uniNameList;
Set[unicodeName[n_Integer /; #1 <= n <= #2], #3] & @@@ uniRangeList;
unicodeName[s_String /; StringLength[s] === 1] := 
 unicodeName[First@ToCharacterCode[s]]
unicodeName[_] := ""

Now we can do either

vowelCodes = Select[
  uniNameList[[All, 1]], 
  StringMatchQ[unicodeName[#], 
    "LATIN " ~~ "SMALL" | "CAPITAL" ~~ " LETTER " ~~ 
     "A" | "E" | "I" | "O" | "U" ~~ " WITH" ~~ ___] &
  ]

(which doesn't include things like æ), or

vowelCodes = Select[
  uniNameList[[All, 1]], 
  StringMatchQ[unicodeName[#], 
    "LATIN " ~~ "SMALL" | "CAPITAL" ~~ " LETTER " ~~ 
     "A" | "E" | "I" | "O" | "U" ~~ ___] &
  ]

(manual filtering needed in this case to get rid of things like ESH - ʃ)

Then you can do FromCharacterCode /@ vowelCodes, but the default font might not show all characters.

The first approach gives me

"ÀÁÂÃÄÅÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜàáâãäåèéêëìíîïòóôõöøùúûüĀāĂăĄąĒēĔĕĖėĘęĚěĨĩĪīĬ\
ĭĮįİŌōŎŏŐőŨũŪūŬŭŮůŰűŲųƗƟƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǞǟǠǡǪǫǬǭǺǻǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍ\
ȎȏȔȕȖȗȦȧȨȩȪȫȬȭȮȯȰȱȺɆɇɨᶏᶒᶖᶙḀḁḔḕḖḗḘḙḚḛḜḝḬḭḮḯṌṍṎṏṐṑṒṓṲṳṴṵṶṷṸṹṺṻẚẠạẢảẤấẦầẨ\
ẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮ\
ữỰựⱥⱸⱺꝊꝋꝌꝍ"

Please note that this filtering by Unicode names is not robust, and the table might easily be missing some vowels (e.g. I can't seem to find a dotless i above)

How to generate diacritized vowel table automatically?

2 Answers2