Eksport unidecode database of ascii equivalents of international characters

Question

How to export data from unidecode python module for use in another language?

This module converts unicode characters to latin (ascii) characters, roughly preserving phonetic meaning like this:

kožušček => kozuscek
北亰 -> Bei Jing
Москва -> Moskva

This is useful for example for creating URL-s for international web pages. There are ports for another languages, like UnidecodeSharp, but aren't very good quality.

score 0 · Answer 1 · answered Jun 05 '15 at 09:47

Here is a Python program unidecode_sqlite.py to export unidecode data to SQLite database, which can be used in every major language:

#!/usr/bin/env python

'''Export unidecode data to SQLite'''

from __future__ import print_function, unicode_literals

import inspect
import os, sys, re
import sqlite3
import unidecode, unicodedata

def unidecode_sqlite(filename):
    '''Export unidecode data to filename'''

    if os.path.exists(filename):
        raise RuntimeError('File exists: %s' % filename)

    conn = sqlite3.connect(filename)
    conn.execute(
        '''create table if not exists unidecode (
            c text primary key,
            category text not null,
            ascii text not null
        )'''
    )

    unidecode_path = os.path.dirname(inspect.getfile(unidecode))

    # Python 2 compatibility
    if 'unichr' in dir(__builtins__):
        unichr_ = unichr
    else:
        unichr_ = chr

    for filename in sorted(os.listdir(unidecode_path)):
        if not os.path.isfile(os.path.join(unidecode_path, filename)):
            continue
        filename_match = re.match(
            r'^x([0-9a-f]{3})\.py$',
            filename,
            re.IGNORECASE
        )
        if not filename_match:
            continue
        section = filename_match.group(1)
        section_start = int("0x"+section, 0)*0x100
        for char_position in range(0x100):
            character = unichr_(section_start+char_position)
            unidecoded_character = unidecode.unidecode(character)
            if unidecoded_character is None or unidecoded_character == '[?]':
                continue
            conn.execute(
                '''insert into unidecode (c, category, ascii)
                    values (?,?,?)''',
                (
                    character,
                    unicodedata.category(character),
                    unidecoded_character
                )
            )
    conn.commit()
    conn.execute('vacuum')

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print('USAGE: %s FILE' % sys.argv[0])
        sys.exit(0)

    try:
        unidecode_sqlite(sys.argv[1])
    except (OSError, RuntimeError) as error:
        print('ERROR: %s' % error, file=sys.stderr)
        sys.exit(1)

This can be used like this on any computer with python (2 or 3, I'm not sure about Windows) and creates 1,3MB file:

virtualenv venv
venv/bin/pip install unidecode
venv/bin/python unidecode_sqlite.py unidecode.sqlite

Note that unidecode is licensed under GPL, which may preclude the use of the exported data in a lot of applications. The original Perl module is under the Perl Artistic License. And the actual data could probably be best gathered from the relevant Unicode publications, if possible, to avoid any licensing problems. — Joey, Jun 05 '15 at 09:48
@Joey I don't use unidecode code but it's output. IANAL, but I don't think a program output is covered by GPL. — Tometzky, Jun 05 '15 at 09:53
You're essentially dumping all its data, which is less program output and more converting data files. IANAL, but that is an area I'd be careful with. — Joey, Jun 05 '15 at 10:06

Eksport unidecode database of ascii equivalents of international characters

1 Answers1